Yggdrasil storage issue 05/09/2022

I think there might be a problem with storage on Yggdrasil. My two jobs (12579916, 12580175) that read from scratch got the following error:

OSError: [Errno 70] Communication error on send: <file_path>

Also, when logging in to the login node, it takes a lot of time before bash is launched.

we had to restart a scratch storage server this afternoon due to an issue with a raid cards and this may have done some perturbations on the scratch storage for a couple of minutes. This should be solved. Apologies for the inconvenient. For us to better handle this situation in the future: did you job died or it was only blocked for a moment and resumed after that?


I am encountering issues on Baobab as well. For example:

(baobab)-[drozd@login2 HET_logbook_202209]$ cat /srv/beegfs/scratch/users/d/drozd/HET_logbook_202209/logs/FM-59701671-131.out
cat: /srv/beegfs/scratch/users/d/drozd/HET_logbook_202209/logs/FM-59701671-131.out: Remote I/O error

Hi, indeed, nothing related but an issue as well:( I’m working on it.

It was a python script, therefore, when OSError (a type of Exception) was raised, it crashed. Potentially, this could be handled, but it is unlikely to be done in the code for research purposes.

Hi, yes this is true, too complicated. What I’ll check is if there is a way for the beegfs to block the access until the service is restored, like we do with nfs. doing so, there is nothing special to do for the application.

Hi, this is fixed, the RAID card has crashed and all the scratch disks on this server weren’t available.