Current issues on Baobab and Yggdrasil

Luca.Capello · April 8, 2021, 7:22am

Hi there,

2021-04-07T23:35:00Z - no space left on login1.yggdrasil:/, again.

This generates the following error: cannot create temp file for here-document: No space left on device .

Investigation in progress.

The CVMFS (cf. CVMFS on Yggdrasil ) cache was filling /var/cache , now moved to a separate filesystem.

2021-04-08T08:08:00Z - The service is restored.

Thx, bye,
Luca

Maurice.Karrenbrock · April 15, 2021, 12:56pm

On baobab login2
bash: cannot create temp file for here-document: No space left on device
when trying to use autocompletion for cd

Yann.Sagon · April 15, 2021, 3:18pm

Dear users,
2021-04-15T13:00:00Z

On baobab login2

bash: cannot create temp file for here-document: No space left on device

Fixed.
2021-04-15T13:45:00Z

Luca.Capello · April 27, 2021, 10:36am

Hi there,

2021-04-27T10:34:00Z - no space left on login2.baobab:/.

This generates the following error: cannot create temp file for here-document: No space left on device .

Investigation in progress.

A 200GB file in /tmp was the culprit, deleted.

Please remember that the login* nodes are only there to copy data to/from the cluster and launch jobs, not for calculations (cf. hpc:hpc_clusters [eResearch Doc] ).

2021-04-27T10:40:00Z - The service is restored.

Thx, bye,
Luca

Luca.Capello · April 30, 2021, 8:34pm

Hi there,

2021-04-30T20:05:00Z - login2.baobab unresponsive.

From a quick analysis, this is due to a very large of I/O operations on the internal HOME/SCRATCH storage via the computational nodes.

Resolution time is unknown.

2021-04-30T21:05:00Z - Baobab SCRATCH storage now back to normal.

Resolution time for the HOME storage is still unknown.

2021-04-30T21:23:00Z - Both Baobab HOME/SCRATCH storage now back to normal.

Thx, bye,
Luca

Stephanie.Bron · May 1, 2021, 1:32pm

Hi,

The login and any operation once logged in is again very slow for me now.

Thanks

Maurice.Karrenbrock · May 3, 2021, 12:13pm

the baobab access node is again very slow (almost unusable)

Yann.Sagon · May 3, 2021, 2:15pm

Hi,

the storage on Baobab was unresponsive today 2021-05-02T22:00:00Z. This impacts the user experience such as using the login node and or the running jobs that are using the storage.

The reason was a user who submitted a job array with every job performing intensive IO operations such as gzip and gunzip of big files. This is clearly something that should be avoided to be done on a compute jobs, specially if this is done dozens of time in parallel.

Best practice:

use the scratch space for temporary files. At least you won’t perturb too much the user experience.
use the local scratch on every node or even the memory (tmpfs)

Thanks for your help and understanding.

Yann.Sagon · May 4, 2021, 9:32am

A post was split to a new topic: Discussion about compression and disk IO

Luca.Capello · May 10, 2021, 8:38am

Hi there,

2021-05-07T20:09:00Z - gpu016.baobab still unavailable (random reboots).

We are still investigating the problem, together with the hardware provider.

Resolution time is unknown.

2021-06-03T08:04:00Z - still reboots, waiting to replace the CPUs.

2021-06-18T12:01:00Z - CPU1 (the second one) replaced, node reinstalled without any issue and back into production.

2021-06-22T08:12:00Z - the node has not rebooted for the last 4 days and after 85 jobs, waiting 3 more days.

2021-07-07T09:33:00Z - node considered fixed, no reboots since 2021-06-18, also considering the new installation during the Baobab scheduled maintenance: 30th of June - 01st of July 2021 .

Thx, bye,
Luca

Yann.Sagon · May 21, 2021, 6:43am

Hello,

2021-05-21T06:00:00Z - baobab master server crash.

storage not available.

2021-05-21T06:44:00Z - fixed

Maurice.Karrenbrock · May 31, 2021, 12:37pm

Hi,
On baobab scratch I am getting

System I/O error:
Error while reading file
  Reason: Remote I/O error
  (call to fgets() returned error code 121)

The files are ok and it doesn’t happen 100% of the times so I guess that it must be some kind of system error

(The error you see here is generated by gromacs but I got a similar one by bash, the script I am running in fact is a bash script that calls multiple times gromacs)

John.Raine · May 31, 2021, 2:25pm

Hi,
just to add to what Maurice has mentioned, we (@Manuel.Guth) are also seeing some odd behaviour with baobab at the moment when accessing files on the scratch disk.
To check if the issue was storage I just tried to call up the quota with beegfs-ctl and got the following error

[raine@login2:~]$ beegfs-ctl --getquota --gid share_atlas --mount=/srv/beegfs/scratch --connAuthFile=/etc/beegfs/connaut
hfile

Quota information for storage pool Default (ID: 1):

(0) 16:13:12 DirectWorker1 [Messaging (RPC)] >> Communication error: Recv(): Hard disconnect from 192.168.104.14:8203. SysErr: Connection reset by peer; Peer: beegfs-storage scratch1.cluster [ID: 23014]. (Message type: GetQuotaInfo (2097))
(0) 16:13:12 Worker1 [Messaging (RPC)] >> Communication error: Recv(): Hard disconnect from 192.168.104.14:8203. SysErr: Connection reset by peer; Peer: beegfs-storage scratch1.cluster [ID: 23014]. (Message type: GetQuotaInfo (2097))
(0) 16:13:12 Worker3 [Messaging (RPC)] >> Communication error: Recv(): Hard disconnect from 192.168.104.14:8203. SysErr: Connection reset by peer; Peer: beegfs-storage scratch1.cluster [ID: 23014]. (Message type: GetQuotaInfo (2097))
(0) 16:13:12 Worker3 [Messaging (RPC)] >> Communication error: Recv(): Hard disconnect from 192.168.104.14:8203. SysErr: Connection reset by peer; Peer: beegfs-storage scratch1.cluster [ID: 23014]. (Message type: GetQuotaInfo (2097))
(0) 16:13:12 DirectWorker1 [Messaging (RPC)] >> Communication error: Recv(): Hard disconnect from 192.168.104.14:8203. SysErr: Connection reset by peer; Peer: beegfs-storage scratch1.cluster [ID: 23014]. (Message type: GetQuotaInfo (2097))

On a second try it worked without an issue, so I wonder if the issue is with some of the beegfs drives/infiniband?

Cheers,
Johnny

Yann.Sagon · May 31, 2021, 2:26pm

Hi,

I’m investigating the issue.

Yann.Sagon · May 31, 2021, 2:50pm

The issue was due to “Too many open files”.

Probably some software that “forgot” to close their files.

BeeGFS storage restarted and fixed at lest for the moment.

Best

Maurice.Karrenbrock · June 14, 2021, 12:52pm

Hi,

The scratch partition on baobab is incredibly slow.

I have also noted some strange behaviours in some runs that I have run this mornig where they created a SLURM err and out file but it is empty and the run died there and other ones didn’t even create the slurm files while others run fine (it was a big job array of small runs)
Might the two things be linked (some kind of timeout in files i/o)?

Yann.Sagon · June 18, 2021, 9:46am

Hi,

it seems both scratch servers were in a bad state (dmesg and no beegfs-storage logs). I’ve restarted the servers. Hopefully this will correct the issue.

Luca.Capello · June 22, 2021, 12:09pm

Hi there,

2021-06-22T12:02:54Z - One of the metadata servers for Baobab’s BeeGFS HOME crashed.

Upstream problem linked to memory allocation, restart en cours on server3.baobab .

2021-06-22T12:10:00Z - The service has been restored.

Thx, bye,
Luca

Luca.Capello · June 24, 2021, 2:03pm

Hi there,

2021-06-14T14:34:34Z - gpu005.baobab unexpectedly powered off.

Investigation en cours , the machine does not power on, time to go on site and check in front of the machine.

Resolution time is unknown.

2021-07-07T13:28:00Z - The machine does not power on at all, out of warranty, waiting for the hardware provider advice.

Thx, bye,
Luca

Luca.Capello · July 17, 2021, 11:00am

Hi there,

2021-07-17T00:20:00Z - I/O error on Yggdrasil ${HOME}.

One of the storage servers providing metadata for the ${HOME} crashed.

Investigation en cours.

The crash actually happened at 00:43.

2021-07-17T11:05:00Z - Service restored.

2021-07-17T11:14:00Z - Another crash of the same metadata server.

2021-07-17T11:57:00Z - Service restored, again.

2021-07-17T12:13:00Z - And the same metadata server crashed again.

I have put in place a dirty automatic restart of this metadata server to survive the week-end, more investigation on Monday.

2021-07-17T16:25:00Z - Service restored, again.

Thx, bye,
Luca