Primary informations
Username: serpolla
Cluster: baobab
Description
I noticed some of my jobs failing while trying to write files on scratch.
They were all running on cpu331.baobab
and indeed, after opening a shell, I noticed scratch is unavailable on that computing node:
(baobab)-[serpolla@login1 ~]$ srun -p shared-cpu --nodelist=cpu331 --pty bash -i
srun: job 12696695 queued and waiting for resources
srun: job 12696695 has been allocated resources
(baobab)-[serpolla@cpu331 ~]$ ls /srv/beegfs/scratch/
<Empty dir>
Until now excluding just that one node seems sufficient, but I do not know if other nodes are experiencing the same issue.