[2024] Current issues on HPC Cluster

Adrien.Albert · July 22, 2024, 11:01am

Baobab: Login node down

Dear users,

The login node on Baobab have crashed. The server have been rebooted and is available again.

We apologize for any inconveniance caused

Thank you for your understanding.

Status : Solved

start: 2024-07-21T18:42:00Z
end:Invalid date

Adrien.Albert · September 12, 2024, 9:29am

Bamboo Scratch Storage Unavailable

Dear HPC Users,

The scratch storage on Bamboo is currently unavailable due to an ongoing issue. Our team has already contacted the provider and we are actively working with them to resolve the situation as quickly as possible.

Please note that the scratch storage have been unmounted on compute and login nodes and will remain unavailable until further notice. We will keep you updated as soon as we have more information on the situation.

Thank you for your understanding,

Best Regards,

Status : Solved

start: 2024-09-10T22:33:00Z
end:2024-09-26T07:33:00Z

Update: the vendor will do an intervention the 25th of September to fix the issue.

The service is back in production without data loss!

Gael.Rossignol · September 17, 2024, 11:02am

Yggdrasil nodes unavailable

Dear HPC Users,

Yggdrasil is currently experiencing issues with its electrical power supply, which has resulted in a reduced number of available nodes on the cluster.

Electricians are working to resolve the issue.

Thank you for your understanding.

Best Regards,

Status : Solved

start: 2024-09-13T21:30:00Z
end: 2024-09-17T12:24:00Z

Yann.Sagon · September 30, 2024, 7:15am

Dear HPC Users,

Yggdrasil is currently experiencing issues with its electrical power supply, which has resulted in a reduced number of available nodes on the cluster.

Same issue as mid September. We’ll check with the datacenter manager what is going on.

Thank you for your understanding.

Best Regards,

Status : Partially solved

start: 2024-09-27T22:02:00Z
stop: 2024-09-30T09:45:00Z

edit: Electrical cabling was modified wrongly on Yggdrasil without notice to us by someone at Astro. Astro IT team is reverting the change. This is a partial workaround as it appears we still have an overload issue that has to be solved.

Yann.Sagon · September 30, 2024, 10:05am

Dear HPC Users,

We’ve set all the nodes in drain in every cluster. As we have an issue with scratch storage, we need to upgrade scripts on every node. No worries, as soon as a node is upgraded, we’ll resume it.

Thank you for your understanding.

Best Regards,

Status : Solved

start: 2024-09-29T22:02:00Z
stop: 2024-10-02T22:02:00Z

Yann.Sagon · October 17, 2024, 7:50am

Dear HPC Users,

Bamboo cluster is currently experiencing issues with quota on home filesystem. The symptom are that the disk usage may be incorrect. We are investigating.

Thank you for your understanding.

Best Regards,

Status : Solved

start: 2024-10-16T22:02:00Z
stop: 2024-11-07T23:12:00Z

Yann.Sagon · October 21, 2024, 7:23am

Dear HPC Users,

Baobab cluster is currently experiencing issues with home storage. We restarted the servers this morning and will now investigate what was the reason of the crash.

Thank you for your understanding.

Best Regards,

Status : Solved

start: 2024-10-20T22:02:00Z
stop: 2024-11-08T23:16:00Z

Gael.Rossignol · December 9, 2024, 9:11am

Bamboo Scratch Storage Unavailable

Dear users,

We have some problems with scratch storage, we are investigation to find a solution.

For the time being, scratch storage is not available on Bamboo.

We’ll keep you informed, thank you for your understanding.

Update :

Disk enclosures have been flashed to avoid the bug.
All scratch storage is now available in read/write and no data have been lost.

Kind regards

Status : Solved

start: 2024-12-08T14:38:00Z
end: 2025-01-10T13:00:00Z

Yann.Sagon · December 18, 2024, 11:24am

Dear users,

We have some problems with slurm controller on Baobab, we are investigation to find a solution.

For the time being, running jobs are continuing, but no new job can start and user commands such as sinfo squeue aren’t working.

We’ll keep you informed, thank you for your understanding.

Kind regards

Status : Solved

start: 2024-12-18T10:00:00Z
end: 2024-12-19T10:00:00Z