Baobab is power off since this morning due to an over temperature in the datacentre of Dufour where Baobab is hosted.
2022-03-14T02:00:00Z
We are investigating, there is an issue with the home storage. It seems the solution will be to restore the content from the Backup, this will takes several days.
We’ll update this post as soon as we have more information.
edit: 24th of March 2022
We had to erase the whole home content, and we are now restoring individuals home according to the answer you gave us in the survey we sent last week.
We had as well to replace a couple of power supply for our storage and network infrastructure
We have a lot of compute nodes in the public-cpu partition that seems to crash as soon as there is a job running on them. We are investigating.
-
We already fully restored data and access to ~120 users for a total of 30TB and many millions of files.
-
We’ve re enabled the account of ~ 70 users more without restoring their home as requested.
-
There is ~ 30 users with ongoing restore of their home
-
There is ~40 users with their home pending to restore
Thanks for your patience.
edit: 25th of March 2022
restore of the latest homes still ongoing.
While checking the restore logs, we saw a lot of files that it wouldn’t worth to backup/restore:
/home/users/x/xxx/.local/share/Trash/files/MATLAB_RUNTIME/
/home/users/x/xxx/.conda
/homeusers/x/xxx/.vscode-server
many other temporary files or old logs as well.
Hi,
login2.baobab.hpc.unige.ch has issue. It rebooted itself 2022-03-31T15:00:00Z and this morning it is power off.
We are investigating.
issue solved. 2022-04-04T12:30:00Z
Issue with storage server home3. 2022-04-04T19:20:00Z
We are investigating.
Symptom is : remote I/O error.
We are still investigating: we need to shutdown the storage server.
Solved 2022-04-06T10:00:00Z
Issue with storage server home2: 2022-05-23T07:00:00Z
The server is rebooting. Home not fully working.
Solved: 2022-05-23T07:40:00Z
The storage is now ultra fast
Issue on login2 cpu stuck → reboot node 2022-06-20T13:00:00Z to 2022-06-20T15:00:00Z
Issue on gpu001.yggdrasil.
2022-06-01T22:00:00Z
We have to send back the server to the vendor, the mainboard is probably broken.
Dear Users,
Start: 2022-07-24T06:07:00Z
End: 2022-07-25T07:00:00Z
One of our Home storage servers on baobab encountered a system problem requiring a reboot.
When writing/reading files or directories some messages like “${path_to_file} Remote I/O error” appeared on the output.
The problem has been fixed. If the error persists, please contact us using the contact template.
Dear Users,
Start: [date=2022-08-12 time=18:00:00 timezone="Europe/Zurich"]
End: 2022-08-15T07:00:00Z
We did a mistake in a script and many nodes on Yggdrasil cluster went to drain mode, preventing new jobs to be scheduled. This is fixed now.
Dear users,
we had an issue with scratch server on Yggdrasil.
Start: 2022-09-05T12:00:00Z
End: 2022-09-05T12:30:00Z
Dear users,
we had an issue with scratch server on Baobab.
Start: 2022-09-05T10:00:00Z
End: 2022-09-05T16:15:00Z
Dear users,
we have a storage performance issue on Baobab on several nodes.
Instead of using the fast method RDMA
the nodes are connecting in TCP
which leads to loss of performance when accessing files from the nodes.
Start: 2022-09-04T22:00:00Z
End: 2022-09-11T22:00:00Z
The only way to fix this issue is to wait for the node to be idle and to restart the BeeGFS client.
Edit: we now have a check on the node itself and if this happens again, the node is set to drain.
Dear users,
we had an issue with scratch server on Baobab.
Start: 2022-09-16T09:30:00Z
End: 2022-09-16T11:00:00Z
Dear users,
we had an issue with scratch server on Yggdrasil and we had to reboot it.
2022-11-16T13:30:00Z→2022-11-16T13:35:00Z
Dear users, we had an issue with the login node of Yggdrasil and we had to reboot it.
2022-11-17T14:00:00Z→2022-11-17T15:20:00Z
Dear users,
we had an issue with login1.yggdrasil
: it was impossible to use srun
or salloc
.
2022-11-17T15:30:00Z→2022-11-18T10:15:00Z
Best
Dear users,
login1.yggdrasil
crashed, we need to restart it.
2022-11-22T13:00:00Z
Dear users,
Yesterday we encounter an issue with scratch2.yggdrasil
while replacing a failled disk storage became unstable with I/O errors. Server reboot resolved problems but are still in contact with our provider to have a fix on this hardware problem.
2022-11-30T08:15:00Z→2022-11-30T10:15:00Z
Sorry for inconvenience,