Issue with scratch storage on Yggdrasil

It looks like the I/O issues are back.

According to the slurm emails I got, it already happened yesterday evening (Sep 19, between 21:30 and 21:45). This caused a lot of jobs to fail in this time range, including jobs getting started because of free queue at this time.
For many cases even no slurm output file got created. I guess, the writing job of slurm itself got I/O errors here.

At the moment everything looked to be working again.
A colleague (user: bavera) now told me, that there is one job missing is his slurm array (slurm ID: 28015777_6) even no slurm output for this missing job either. The other jobs got started this morning. He now resubmitted the missing job.

Nevertheless, I’d ask the HPC people to have a look for the scratch on yggdrasil, whether there are more issues, when the people realized so far and whether there is some action needed to make sure that no jobs get lost in the future.

Thanks,
Matthias

While using the cluster yesterday night, I can confirm that scratch on yggdrasil was unavailable around that time. ls and cd in any scratch directory gave errors. After a couple minutes everything went back to normal.

Dear All,

We are currently investigating this problem. We’ve seen a significant increase in I/O activity over this period, and we’re working on it to identify those responsible and contact them to understand what’s going wrong.

We apologize for any inconvenience caused.

Best Regards,