Current issues on Baobab and Yggdrasil

Emeline.Bolmont · January 26, 2022, 11:23am

Now we know:

Due to an important security flaw found on Linux distributions, all access to the clusters have been closed until the fix deployment.

It is not necessary to contact us, we will not respond to emails on this subject.

We will keep you informed about the progress of the problem and the reopening of access.

Yann.Sagon · March 14, 2022, 10:54am

Baobab is power off since this morning due to an over temperature in the datacentre of Dufour where Baobab is hosted.

2022-03-14T02:00:00Z

We are investigating, there is an issue with the home storage. It seems the solution will be to restore the content from the Backup, this will takes several days.

We’ll update this post as soon as we have more information.

edit: 24th of March 2022

We had to erase the whole home content, and we are now restoring individuals home according to the answer you gave us in the survey we sent last week.

We had as well to replace a couple of power supply for our storage and network infrastructure

We have a lot of compute nodes in the public-cpu partition that seems to crash as soon as there is a job running on them. We are investigating.

We already fully restored data and access to ~120 users for a total of 30TB and many millions of files.
We’ve re enabled the account of ~ 70 users more without restoring their home as requested.
There is ~ 30 users with ongoing restore of their home
There is ~40 users with their home pending to restore

Thanks for your patience.

edit: 25th of March 2022

restore of the latest homes still ongoing.

While checking the restore logs, we saw a lot of files that it wouldn’t worth to backup/restore:

/home/users/x/xxx/.local/share/Trash/files/MATLAB_RUNTIME/
/home/users/x/xxx/.conda
/homeusers/x/xxx/.vscode-server

many other temporary files or old logs as well.

Yann.Sagon · April 1, 2022, 7:02am

Hi,

login2.baobab.hpc.unige.ch has issue. It rebooted itself 2022-03-31T15:00:00Z and this morning it is power off.

We are investigating.

issue solved. 2022-04-04T12:30:00Z

Yann.Sagon · April 5, 2022, 7:50am

Issue with storage server home3. 2022-04-04T19:20:00Z

We are investigating.

Symptom is : remote I/O error.

We are still investigating: we need to shutdown the storage server.

Solved 2022-04-06T10:00:00Z

Yann.Sagon · May 23, 2022, 7:16am

Issue with storage server home2: 2022-05-23T07:00:00Z

The server is rebooting. Home not fully working.

Solved: 2022-05-23T07:40:00Z

The storage is now ultra fast

Gael.Rossignol · June 22, 2022, 7:32am

Issue on login2 cpu stuck → reboot node 2022-06-20T13:00:00Z to 2022-06-20T15:00:00Z

Yann.Sagon · June 29, 2022, 11:34am

Issue on gpu001.yggdrasil.

2022-06-01T22:00:00Z

We have to send back the server to the vendor, the mainboard is probably broken.

Adrien.Albert · July 25, 2022, 7:29am

Dear Users,

Start: 2022-07-24T06:07:00Z

End: 2022-07-25T07:00:00Z

One of our Home storage servers on baobab encountered a system problem requiring a reboot.

When writing/reading files or directories some messages like “${path_to_file} Remote I/O error” appeared on the output.

The problem has been fixed. If the error persists, please contact us using the contact template.

Yann.Sagon · August 15, 2022, 8:33am

Dear Users,

Start: [date=2022-08-12 time=18:00:00 timezone="Europe/Zurich"]

End: 2022-08-15T07:00:00Z

We did a mistake in a script and many nodes on Yggdrasil cluster went to drain mode, preventing new jobs to be scheduled. This is fixed now.

Yann.Sagon · September 5, 2022, 4:18pm

Dear users,

we had an issue with scratch server on Yggdrasil.

Start: 2022-09-05T12:00:00Z
End: 2022-09-05T12:30:00Z

Yann.Sagon · September 5, 2022, 4:19pm

Dear users,

we had an issue with scratch server on Baobab.

Start: 2022-09-05T10:00:00Z
End: 2022-09-05T16:15:00Z

Yann.Sagon · September 9, 2022, 3:19pm

Dear users,

we have a storage performance issue on Baobab on several nodes.

Instead of using the fast method RDMA the nodes are connecting in TCP which leads to loss of performance when accessing files from the nodes.

Start: 2022-09-04T22:00:00Z
End: 2022-09-11T22:00:00Z

The only way to fix this issue is to wait for the node to be idle and to restart the BeeGFS client.

Edit: we now have a check on the node itself and if this happens again, the node is set to drain.

Gael.Rossignol · September 16, 2022, 11:49am

Dear users,

we had an issue with scratch server on Baobab.

Start: 2022-09-16T09:30:00Z
End: 2022-09-16T11:00:00Z

Yann.Sagon · November 17, 2022, 3:34pm

Dear users,

we had an issue with scratch server on Yggdrasil and we had to reboot it.

2022-11-16T13:30:00Z→2022-11-16T13:35:00Z

Yann.Sagon · November 17, 2022, 3:35pm

Dear users, we had an issue with the login node of Yggdrasil and we had to reboot it.

2022-11-17T14:00:00Z→2022-11-17T15:20:00Z

Yann.Sagon · November 18, 2022, 10:23am

Dear users,

we had an issue with login1.yggdrasil: it was impossible to use srun or salloc.

2022-11-17T15:30:00Z→2022-11-18T10:15:00Z

Best

Yann.Sagon · November 22, 2022, 1:06pm

Dear users,

login1.yggdrasil crashed, we need to restart it.

2022-11-22T13:00:00Z

Gael.Rossignol · December 1, 2022, 8:39am

Dear users,

Yesterday we encounter an issue with scratch2.yggdrasil while replacing a failled disk storage became unstable with I/O errors. Server reboot resolved problems but are still in contact with our provider to have a fix on this hardware problem.

2022-11-30T08:15:00Z→2022-11-30T10:15:00Z

Sorry for inconvenience,