Gpu017 down (again)

Luca.Capello · April 26, 2021, 12:53pm

Hi there,

gpu017.baobab back into production.

I am not aware of any persistent issue with this specific node.

FYI, the “not-in-IDLE” Slurm reason for a specific node is available via the following command:

$ scontrol show Node=${NODE} | \
 grep -E '(State|Reason)'

Going back to this node, AFAIK, since the beginning of April 2021, gpu017.baobab was DOWN (i.e. Slurm lost contact with this node) twice because for a still-unknown reason the node was physically shut down (cf. Gpu012 and gpu017 down - #5 by Luca.Capello ).

This time, however, the node was not in Slurm DOWN state, but in DRAIN, which happens basically for 2 main reasons:

Slurm itself found an error (which was the case here because an out-of-time job could not be properly killed)
the every-3-minute health check considered that the node was not fit for production (in which case the Slurm reason starts with health_ )

In both cases, the idea is that a manual check is needed to confirm that the problem was transient. Given that we have a daily reporting for the nodes in DRAIN/DOWN states, we usually do this check everyday if no other more urgent tasks.

Thx, bye,
Luca