we had to restart a scratch storage server this afternoon due to an issue with a raid cards and this may have done some perturbations on the scratch storage for a couple of minutes. This should be solved. Apologies for the inconvenient. For us to better handle this situation in the future: did you job died or it was only blocked for a moment and resumed after that?
It was a python script, therefore, when OSError (a type of Exception) was raised, it crashed. Potentially, this could be handled, but it is unlikely to be done in the code for research purposes.
Hi, yes this is true, too complicated. What I’ll check is if there is a way for the beegfs to block the access until the service is restored, like we do with nfs. doing so, there is nothing special to do for the application.