Current issues on Baobab and Yggdrasil

Yann.Sagon · March 13, 2020, 8:43am

This post will contains the current and past issues related to the HPC infrastructure Baobab and Yggdrasil.

Yann.Sagon · March 13, 2020, 8:47am

2020-03-13T00:29:00Z - Error accessing some files on home directories.

The storage of the home directories is composed by 4 servers with eight services. One of them crashed, and all the files hosted by this server weren’t accessible anymore.

2020-03-13T08:38:00Z - The service is restored

Luca.Capello · April 1, 2020, 7:54am

Hi there,

2020-03-31T22:13:15Z - SlurmDBD crashes.

Priority calculations and some accounting commands (e.g. sacct ) give an error.

2020-04-01T07:51:19Z - The service is restored.

Thx, bye,
Luca

Luca.Capello · April 1, 2020, 12:37pm

Hi there,

2020-04-01T07:56:27Z - SlurmDBD crashes.

Priority calculations and some accounting commands (e.g. sacct ) give an error.

2020-04-01T12:33:19Z - The service is restored.

Thx, bye,
Luca

Luca.Capello · April 1, 2020, 2:58pm

Hi there,

2020-04-01T13:06:22Z - SlurmDBD crashes.

Priority calculations and some accounting commands (e.g. sacct ) give an error.

2020-04-01T14:56:40Z - The service is restored.

Thx, bye,
Luca

Yann.Sagon · April 5, 2020, 9:17am

2020-04-04T07:40:00Z → 2020-04-05T09:15:00Z- Baobab Login2 unreachable.

Baobab login node2 was powered off automatically due to a thermal issue Saturday morning. Restarted now and available again.

Yann.Sagon · April 6, 2020, 9:18am

2020-04-05T17:35:00Z Baobab Login2 unreachable.

I restarted it this morning. We have a lot of warning about high temperature on one of the CPU package. We need to investigate but for this we need to shutdown the server. We are currently preparing a replacement login node and let you know when we’ll do the swap.

Yann.Sagon · April 14, 2020, 3:04pm

2020-04-06T22:00:00Z → 2020-04-13T22:00:00Z Some nodes had an issue to access Baobab and the private storage “space”.

slurmstepd: error: execve(): matlab: No such file or directory

It’s now fixed.

Best

Yann.Sagon · May 28, 2020, 7:31am

2020-05-28T07:20:00Z → 2020-05-28T09:45:00Z

Cooling issue on the datacentre. All nodes down.

Now fixed. Cluster in production again.

Yann.Sagon · June 2, 2020, 10:59am

2020-06-01T22:00:00Z

Some GPU and compute nodes are down due to a power outage on the DC. More on this soon.

Yann.Sagon · June 4, 2020, 2:24pm

A post was split to a new topic: Performance issue with GPU nodes on cui-gpu-EL7

Luca.Capello · June 18, 2020, 9:29am

Hi there,

2020-06-17T23:08:13Z - gpu[002,012] not responding.

Machines not reachable remotely, either power problem (maybe related to the recent GPU upgrade, cf. ReqNodeNotAvail - normal behavior? ) or completely stuck at BIOS level.

Time to go to the UniDufour DC…

2020-06-18T15:41:00Z - gpu012 back into production, gpu002 waiting for PSU2 replacement.

2020-07-06T20:10:00Z - gpu002 back into production as well.

Thx, bye,
Luca

Luca.Capello · June 18, 2020, 12:54pm

Hi there,

2020-06-18T09:41:00Z - server3 power redundancy loss.

Reason still unknown, PSU2 broke 20 seconds after taking the full charge from PSU1.

Given that this machine is a member of the ${HOME} storage, some ${HOME} were not available during the issue.

2020-06-18T09:44:00Z - The service is restored.

Thx, bye,
Luca

Luca.Capello · June 18, 2020, 3:20pm

A post was split to a new topic: gpu[002,012] still down at 16:00 on 2020-06-18

Luca.Capello · June 18, 2020, 3:29pm

Hi there,

2020-06-18T13:59:00Z - leaf7 crashes.

Reason still unknown, InfiniBand connection lost for some nodes only, hard power cycle and everything back to normal.

2020-06-18T15:29:00Z - The service is restored.

Thx, bye,
Luca

Massimo.Brero · June 22, 2020, 9:34am

Dear all,

One of the metadata server for the BeeGFS storing the $HOME directories crashed on Saturday 20 June 2020 22:44. It was restored this morning around 10:00.
During this time, some of the $HOME were unavailable, resulting in this message if you tried to log in :
Could not chdir to home directory /home/users/x/<USERNAME>: Communication error on send
or this one Communication error on send if you were already connected.

If you had jobs running, they probably failed and needs to be re-submitted to Slurm.

We are sorry for the inconvenience.

All the best,

Massimo

Luca.Capello · July 10, 2020, 12:33pm

Hi there,

2020-07-10T11:48:00Z - slurmctld crashes.

While adding back a GPU card on gpu005 I forgot to restart slurmctld because the configuration files changed and this leads to a segfault.

2020-07-10T12:25:00Z - The service is restored.

Thx, bye,
Luca

Luca.Capello · July 23, 2020, 2:24pm

Hi there,

2020-07-23T04:08:46Z - gpu[002] rebooted.

After a quick investigation, again a PSU problem (cf. Current issues on Baobab ). The machine is anyway usable while we check with DALCO (our upstream supplier).

Thx, bye,
Luca

Massimo.Brero · August 4, 2020, 11:31am

3 posts were split to a new topic: Question about mono-EL7 and shared-EL7 partitions usage

Yann.Sagon · September 24, 2020, 7:57am

Dear all,

One of the metadata server for the BeeGFS storing the $HOME directories crashed at around 2020-09-23T14:04:00Z. The service was restored at around 2020-09-23T19:06:00Z

During this time, some of the files in your $HOME were unavailable, resulting in this message if you tried to log in or access the files:
Could not chdir to home directory /home/users/x/<USERNAME>: Communication error on send
or this one Communication error on send if you were already connected.

If you had jobs running, they probably failed and needs to be re-submitted to Slurm.

We are sorry for the inconvenience.

All the best,

HPC team

edit: a side effect is that all the nodes were put in drain, preventing to process new jobs. This is fixed as well since 2020-09-23T22:00:00Z