Gpu017 down (again)

John.Raine · April 22, 2021, 11:14am

Hi,

gpu017 is down again.
Is there a persistent issue with this node that keeps causing it to go into a drain state?

Cheers,
Johnny

Luca.Capello · April 26, 2021, 12:53pm

Hi there,

gpu017.baobab back into production.

I am not aware of any persistent issue with this specific node.

FYI, the “not-in-IDLE” Slurm reason for a specific node is available via the following command:

$ scontrol show Node=${NODE} | \
 grep -E '(State|Reason)'

Going back to this node, AFAIK, since the beginning of April 2021, gpu017.baobab was DOWN (i.e. Slurm lost contact with this node) twice because for a still-unknown reason the node was physically shut down (cf. Gpu012 and gpu017 down - #5 by Luca.Capello ).

This time, however, the node was not in Slurm DOWN state, but in DRAIN, which happens basically for 2 main reasons:

Slurm itself found an error (which was the case here because an out-of-time job could not be properly killed)
the every-3-minute health check considered that the node was not fit for production (in which case the Slurm reason starts with health_ )

In both cases, the idea is that a manual check is needed to confirm that the problem was transient. Given that we have a daily reporting for the nodes in DRAIN/DOWN states, we usually do this check everyday if no other more urgent tasks.

Thx, bye,
Luca

John.Raine · April 27, 2021, 12:32pm

Hi,

I am aware of the scontrol command, and used it to check before reporting here - by down I clearly meant unable to run jobs, as noted by the fact I referred to it being in the drain state.

I think it is a reasonable question to ask if it is related to the other issues with this node, as it seems to be persistently this one node giving us frequent issues, and is our latest investment into the cluster. We also reported issues with the read speed, which again was with gpu017. If there are any issues with it, it would be great to address them whilst everything is still in warranty and before we make our next investment pledge!

Thanks for getting it back online. If you could please see whether the kill signal not being received was related to the issues that caused it to become unresponsive previously, and the slow scratchdisk read speed we reported, that would be greatly appreciated.

Thanks in advance.
Cheers,
Johnny

Yann.Sagon · April 27, 2021, 4:19pm

Hi Johnny,

The good new is that it’s fixed: we had an issue with infiniband cabling.
See here: Slow CPU usage on gpu017

Hope that solves the maybe related issues. Please let us know if you face an issue again.

Best

Luca.Capello · April 28, 2021, 9:47am

Hi there,

I am sorry if my words were too harsh, it was not my intention, thus my apologies.

And the scontrol example, as well as the node DRAIN/DOWN Slurm states (which is the reason I wrote them in uppercase), was not strictly for you (I know you are a cluster/Slurm power user), but for other forum users.

This was a way to be more open about the cluster situation and management, especially for those things that are not (yet?) in the documentation, as it happened in the past for the ${HOME} and ${SCRATCH} permissions (cf. Some directory permission seem to change on their own - #3 by Volodymyr.Savchenko ).

Fully agree, and FWIW I have not considered your question out of scope.

Thx, bye,
Luca