Gpu029 is down?

Dear HPC Team,

I am writing on behalf of the SIP Group who have been encountering difficulty in accessing the gpu029 via the private partition. We suspect that the node might be down or facing some technical glitches.

Upon attempting to salloc an interactive session, we received the following error message:

salloc: Required node not available (down, drained or reserved)

We would greatly appreciate it if you could kindly look into this matter and check the status of the gpu029 node.

Please let us know if you require any further information to assist with the troubleshooting. Your prompt attention to this issue is highly appreciated.

Thank you for your continuous support.

Best regards,

Vitaliy Kinakh

Dear @Vitaliy.Kinakh,

Our support round includes the analysis of down/drained nodes. So we are aware about all node out of production.

You can list reasons nodes are in the down, drained, fail or failing state:

(baobab)-[root@admin1 ~]$ sinfo -R -n cpu287
REASON               USER      TIMESTAMP           NODELIST
health_BEEGFS__tcp_c root      2023-07-26T22:30:07 cpu287

Unfortunately some nodes are being drained for admin reason due to Hardware/software or unknown issue and need further analysis to be sure they will not impact jobs.

gpu029 is back into production.

We apologize for any inconvenience cause.

Best Regards,