Gpu012 and gpu017 down

Hi,

the nodes have been down for around an hour, triggered by slurm reporting them as not responding. When trying to ping them they are unreachable, so I don’t think it is a slurm daemon issue but power/network?

Cheers,
Johnny

HI there,

  • gpu012.baobab does not respond via IPMI either, nothing in the netconsole logs, thus probably the power circuit breaker is at cause, this will wait until Monday.

  • gpu017.baobab was off, nothing in the IPMI logs neither in the system ones (not even the netconsole ones).
    According to the Slurm logs, @Lukas.Ehrke from your group was running the interactive JobID 45860249d just before the node stopped responding, I will contact it next Monday for more information about the job type.
    In the meantime, node RESUMEd.

Thx, bye,
Luca

Hi there,

Indeed, the electric current demanded by gpu004.baobab and gpu012.baobab together reached 19.1A and thus the 16A power circuit breaker did its job.

Nodes RESUMEd.

Thx, bye,
Luca

Hi @Luca.Capello,

Looks like gpu017 is down again for now, and unreachable when I try to ping.

Hi there,

sorry for the delay.

gpu017.baobab back into production, no problem in the logs and @Lukas.Ehrke was again the last one to have jobs running there.

Thx, bye,
Luca