[2024] Current issues on HPC Cluster

Baobab: Login node down

Dear users,

The login node on Baobab have crashed. The server have been rebooted and is available again.

We apologize for any inconveniance caused

Thank you for your understanding.

Status : Solved :green_circle:

start: 2024-07-21T18:42:00Z
end:Invalid date

1 Like

Bamboo Scratch Storage Unavailable

Dear HPC Users,

The scratch storage on Bamboo is currently unavailable due to an ongoing issue. Our team has already contacted the provider and we are actively working with them to resolve the situation as quickly as possible.

Please note that the scratch storage have been unmounted on compute and login nodes and will remain unavailable until further notice. We will keep you updated as soon as we have more information on the situation.

Thank you for your understanding,

Best Regards,

Status : Solved :green_circle:

start: 2024-09-10T22:33:00Z
end:2024-09-26T07:33:00Z

Update: the vendor will do an intervention the 25th of September to fix the issue.

The service is back in production without data loss!

Yggdrasil nodes unavailable

Dear HPC Users,

Yggdrasil is currently experiencing issues with its electrical power supply, which has resulted in a reduced number of available nodes on the cluster.

Electricians are working to resolve the issue.

Thank you for your understanding.

Best Regards,

Status : Solved :green_circle:

start: 2024-09-13T21:30:00Z
end: 2024-09-17T12:24:00Z

Dear HPC Users,

Yggdrasil is currently experiencing issues with its electrical power supply, which has resulted in a reduced number of available nodes on the cluster.

Same issue as mid September. We’ll check with the datacenter manager what is going on.

Thank you for your understanding.

Best Regards,

Status : Partially solved :green_circle:

start: 2024-09-27T22:02:00Z
stop: 2024-09-30T09:45:00Z

edit: Electrical cabling was modified wrongly on Yggdrasil without notice to us by someone at Astro. Astro IT team is reverting the change. This is a partial workaround as it appears we still have an overload issue that has to be solved.

Dear HPC Users,

We’ve set all the nodes in drain in every cluster. As we have an issue with scratch storage, we need to upgrade scripts on every node. No worries, as soon as a node is upgraded, we’ll resume it.

Thank you for your understanding.

Best Regards,

Status : Solved :green_circle:

start: 2024-09-29T22:02:00Z
stop: 2024-10-02T22:02:00Z

Dear HPC Users,

Bamboo cluster is currently experiencing issues with quota on home filesystem. The symptom are that the disk usage may be incorrect. We are investigating.

Thank you for your understanding.

Best Regards,

Status : Solved :green_circle:

start: 2024-10-16T22:02:00Z
stop: 2024-11-07T23:12:00Z

Dear HPC Users,

Baobab cluster is currently experiencing issues with home storage. We restarted the servers this morning and will now investigate what was the reason of the crash.

Thank you for your understanding.

Best Regards,

Status : Solved :green_circle:

start: 2024-10-20T22:02:00Z
stop: 2024-11-08T23:16:00Z

Bamboo Scratch Storage Unavailable

Dear users,

We have some problems with scratch storage, we are investigation to find a solution.

For the time being, scratch storage is not available on Bamboo.

We’ll keep you informed, thank you for your understanding.

Update :

Disk enclosures have been flashed to avoid the bug.
All scratch storage is now available in read/write and no data have been lost.

Kind regards

Status : Solved :green_circle:

start: 2024-12-08T14:38:00Z
end: 2025-01-10T13:00:00Z

Dear users,

We have some problems with slurm controller on Baobab, we are investigation to find a solution.

For the time being, running jobs are continuing, but no new job can start and user commands such as sinfo squeue aren’t working.

We’ll keep you informed, thank you for your understanding.

Kind regards

Status : Solved :green_circle:

start: 2024-12-18T10:00:00Z
end: 2024-12-19T10:00:00Z