Yggdrasil storage issue 05/09/2022

maciej.falkiewicz · September 5, 2022, 12:35pm

I think there might be a problem with storage on Yggdrasil. My two jobs (12579916, 12580175) that read from scratch got the following error:

OSError: [Errno 70] Communication error on send: <file_path>

Also, when logging in to the login node, it takes a lot of time before bash is launched.

Kind regards,
Maciej Falkiewicz

Yann.Sagon · September 5, 2022, 1:56pm

Hi,

we had to restart a scratch storage server this afternoon due to an issue with a raid cards and this may have done some perturbations on the scratch storage for a couple of minutes. This should be solved. Apologies for the inconvenient. For us to better handle this situation in the future: did you job died or it was only blocked for a moment and resumed after that?

Best

David.Droz · September 5, 2022, 2:54pm

I am encountering issues on Baobab as well. For example:

(baobab)-[drozd@login2 HET_logbook_202209]$ cat /srv/beegfs/scratch/users/d/drozd/HET_logbook_202209/logs/FM-59701671-131.out
cat: /srv/beegfs/scratch/users/d/drozd/HET_logbook_202209/logs/FM-59701671-131.out: Remote I/O error

Yann.Sagon · September 5, 2022, 3:46pm

Hi, indeed, nothing related but an issue as well:( I’m working on it.

maciej.falkiewicz · September 5, 2022, 3:55pm

It was a python script, therefore, when OSError (a type of Exception) was raised, it crashed. Potentially, this could be handled, but it is unlikely to be done in the code for research purposes.

Yann.Sagon · September 5, 2022, 4:09pm

Hi, yes this is true, too complicated. What I’ll check is if there is a way for the beegfs to block the access until the service is restored, like we do with nfs. doing so, there is nothing special to do for the application.

Yann.Sagon · September 5, 2022, 4:10pm

Hi, this is fixed, the RAID card has crashed and all the scratch disks on this server weren’t available.