Hi there,
gpu017.baobab back into production.
I am not aware of any persistent issue with this specific node.
FYI, the “not-in-IDLE” Slurm reason for a specific node is available via the following command:
$ scontrol show Node=${NODE} | \
grep -E '(State|Reason)'
Going back to this node, AFAIK, since the beginning of April 2021, gpu017.baobab was DOWN (i.e. Slurm lost contact with this node) twice because for a still-unknown reason the node was physically shut down (cf. Gpu012 and gpu017 down - #5 by Luca.Capello ).
This time, however, the node was not in Slurm DOWN state, but in DRAIN, which happens basically for 2 main reasons:
- Slurm itself found an error (which was the case here because an out-of-time job could not be properly killed)
- the every-3-minute health check considered that the node was not fit for production (in which case the Slurm reason starts with
health_
)
In both cases, the idea is that a manual check is needed to confirm that the problem was transient. Given that we have a daily reporting for the nodes in DRAIN/DOWN states, we usually do this check everyday if no other more urgent tasks.
Thx, bye,
Luca