2021-04-27T10:34:00Z - no space left on login2.baobab:/.
This generates the following error: cannot create temp file for here-document: No space left on device .
Investigation in progress.
A 200GB file in /tmp was the culprit, deleted.
Please remember that the login* nodes are only there to copy data to/from the cluster and launch jobs, not for calculations (cf. hpc:hpc_clusters [eResearch Doc] ).
the storage on Baobab was unresponsive today 2021-05-02T22:00:00Z. This impacts the user experience such as using the login node and or the running jobs that are using the storage.
The reason was a user who submitted a job array with every job performing intensive IO operations such as gzip and gunzip of big files. This is clearly something that should be avoided to be done on a compute jobs, specially if this is done dozens of time in parallel.
Best practice:
use the scratch space for temporary files. At least you wonât perturb too much the user experience.
use the local scratch on every node or even the memory (tmpfs)
System I/O error:
Error while reading file
Reason: Remote I/O error
(call to fgets() returned error code 121)
The files are ok and it doesnât happen 100% of the times so I guess that it must be some kind of system error
(The error you see here is generated by gromacs but I got a similar one by bash, the script I am running in fact is a bash script that calls multiple times gromacs)
Hi,
just to add to what Maurice has mentioned, we (@Manuel.Guth) are also seeing some odd behaviour with baobab at the moment when accessing files on the scratch disk.
To check if the issue was storage I just tried to call up the quota with beegfs-ctl and got the following error
[raine@login2:~]$ beegfs-ctl --getquota --gid share_atlas --mount=/srv/beegfs/scratch --connAuthFile=/etc/beegfs/connaut
hfile
Quota information for storage pool Default (ID: 1):
(0) 16:13:12 DirectWorker1 [Messaging (RPC)] >> Communication error: Recv(): Hard disconnect from 192.168.104.14:8203. SysErr: Connection reset by peer; Peer: beegfs-storage scratch1.cluster [ID: 23014]. (Message type: GetQuotaInfo (2097))
(0) 16:13:12 Worker1 [Messaging (RPC)] >> Communication error: Recv(): Hard disconnect from 192.168.104.14:8203. SysErr: Connection reset by peer; Peer: beegfs-storage scratch1.cluster [ID: 23014]. (Message type: GetQuotaInfo (2097))
(0) 16:13:12 Worker3 [Messaging (RPC)] >> Communication error: Recv(): Hard disconnect from 192.168.104.14:8203. SysErr: Connection reset by peer; Peer: beegfs-storage scratch1.cluster [ID: 23014]. (Message type: GetQuotaInfo (2097))
(0) 16:13:12 Worker3 [Messaging (RPC)] >> Communication error: Recv(): Hard disconnect from 192.168.104.14:8203. SysErr: Connection reset by peer; Peer: beegfs-storage scratch1.cluster [ID: 23014]. (Message type: GetQuotaInfo (2097))
(0) 16:13:12 DirectWorker1 [Messaging (RPC)] >> Communication error: Recv(): Hard disconnect from 192.168.104.14:8203. SysErr: Connection reset by peer; Peer: beegfs-storage scratch1.cluster [ID: 23014]. (Message type: GetQuotaInfo (2097))
On a second try it worked without an issue, so I wonder if the issue is with some of the beegfs drives/infiniband?
The scratch partition on baobab is incredibly slow.
I have also noted some strange behaviours in some runs that I have run this mornig where they created a SLURM err and out file but it is empty and the run died there and other ones didnât even create the slurm files while others run fine (it was a big job array of small runs)
Might the two things be linked (some kind of timeout in files i/o)?
it seems both scratch servers were in a bad state (dmesg and no beegfs-storage logs). Iâve restarted the servers. Hopefully this will correct the issue.