It appears that several nodes are currently in drain/down state.
Here the status of this incident:
Computes are back into production.
We have identified the root cause of the issue. An infiniband switch is experiencing hardware malfunction, resulting in interference with the infiniband traffic across the entire cluster. Currently, the nodes connected to this switch have been temporarily disabled and powered off. We are awaiting the replacement hardware to restore these nodes.
We apologize for any inconvenience this may have caused.
It might be a related issue, but it seems like Scratch is not working at the moment.
it may be related, we’re working on it. We will keep you inform.