It looks like the I/O issues are back.
According to the slurm emails I got, it already happened yesterday evening (Sep 19, between 21:30 and 21:45). This caused a lot of jobs to fail in this time range, including jobs getting started because of free queue at this time.
For many cases even no slurm output file got created. I guess, the writing job of slurm itself got I/O errors here.
At the moment everything looked to be working again.
A colleague (user: bavera) now told me, that there is one job missing is his slurm array (slurm ID: 28015777_6) even no slurm output for this missing job either. The other jobs got started this morning. He now resubmitted the missing job.
Nevertheless, I’d ask the HPC people to have a look for the scratch on yggdrasil, whether there are more issues, when the people realized so far and whether there is some action needed to make sure that no jobs get lost in the future.
Thanks,
Matthias