[2024] Current issues on HPC Cluster

Baobab: Login node down

Dear users,

The login node on Baobab have crashed. The server have been rebooted and is available again.

We apologize for any inconveniance caused

Thank you for your understanding.

Status : Solved :green_circle:

start: 2024-07-21T18:42:00Z
end:Invalid date

1 Like

Bamboo Scratch Storage Unavailable

Dear HPC Users,

The scratch storage on Bamboo is currently unavailable due to an ongoing issue. Our team has already contacted the provider and we are actively working with them to resolve the situation as quickly as possible.

Please note that the scratch storage have been unmounted on compute and login nodes and will remain unavailable until further notice. We will keep you updated as soon as we have more information on the situation.

Thank you for your understanding,

Best Regards,

Status : Solved :green_circle:

start: 2024-09-10T22:33:00Z
end:2024-09-26T07:33:00Z

Update: the vendor will do an intervention the 25th of September to fix the issue.

The service is back in production without data loss!

Yggdrasil nodes unavailable

Dear HPC Users,

Yggdrasil is currently experiencing issues with its electrical power supply, which has resulted in a reduced number of available nodes on the cluster.

Electricians are working to resolve the issue.

Thank you for your understanding.

Best Regards,

Status : Solved :green_circle:

start: 2024-09-13T21:30:00Z
end: 2024-09-17T12:24:00Z

Dear HPC Users,

Yggdrasil is currently experiencing issues with its electrical power supply, which has resulted in a reduced number of available nodes on the cluster.

Same issue as mid September. We’ll check with the datacenter manager what is going on.

Thank you for your understanding.

Best Regards,

Status : Partially solved :green_circle:

start: 2024-09-27T22:02:00Z
stop: 2024-09-30T09:45:00Z

edit: Electrical cabling was modified wrongly on Yggdrasil without notice to us by someone at Astro. Astro IT team is reverting the change. This is a partial workaround as it appears we still have an overload issue that has to be solved.

Dear HPC Users,

We’ve set all the nodes in drain in every cluster. As we have an issue with scratch storage, we need to upgrade scripts on every node. No worries, as soon as a node is upgraded, we’ll resume it.

Thank you for your understanding.

Best Regards,

Status : In progress :red_circle:

start: 2024-09-29T22:02:00Z