Hi,
It appears that several nodes are currently in drain/down state.
Regards,
Debajyoti
Dear @Debajyoti.Sengupta
Here the status of this incident:
Computes are back into production.
Best Regards
Dear @Debajyoti.Sengupta
We have identified the root cause of the issue. An infiniband switch is experiencing hardware malfunction, resulting in interference with the infiniband traffic across the entire cluster. Currently, the nodes connected to this switch have been temporarily disabled and powered off. We are awaiting the replacement hardware to restore these nodes.
We apologize for any inconvenience this may have caused.
Best Regards,
It might be a related issue, but it seems like Scratch is not working at the moment.