This post will contains the current and past issues related to the HPC infrastructure Baobab and Yggdrasil.
2020-03-13T00:29:00Z - Error accessing some files on home directories.
The storage of the home directories is composed by 4 servers with eight services. One of them crashed, and all the files hosted by this server weren’t accessible anymore.
2020-03-13T08:38:00Z - The service is restored
Hi there,
2020-03-31T22:13:15Z - SlurmDBD crashes.
Priority calculations and some accounting commands (e.g. sacct
) give an error.
2020-04-01T07:51:19Z - The service is restored.
Thx, bye,
Luca
Hi there,
2020-04-01T07:56:27Z - SlurmDBD crashes.
Priority calculations and some accounting commands (e.g. sacct
) give an error.
2020-04-01T12:33:19Z - The service is restored.
Thx, bye,
Luca
Hi there,
2020-04-01T13:06:22Z - SlurmDBD crashes.
Priority calculations and some accounting commands (e.g. sacct
) give an error.
2020-04-01T14:56:40Z - The service is restored.
Thx, bye,
Luca
2020-04-04T07:40:00Z → 2020-04-05T09:15:00Z- Baobab Login2 unreachable.
Baobab login node2 was powered off automatically due to a thermal issue Saturday morning. Restarted now and available again.
2020-04-05T17:35:00Z Baobab Login2 unreachable.
I restarted it this morning. We have a lot of warning about high temperature on one of the CPU package. We need to investigate but for this we need to shutdown the server. We are currently preparing a replacement login node and let you know when we’ll do the swap.
2020-04-06T22:00:00Z → 2020-04-13T22:00:00Z Some nodes had an issue to access Baobab and the private storage “space”.
slurmstepd: error: execve(): matlab: No such file or directory
It’s now fixed.
Best
2020-05-28T07:20:00Z → 2020-05-28T09:45:00Z
Cooling issue on the datacentre. All nodes down.
Now fixed. Cluster in production again.
2020-06-01T22:00:00Z
Some GPU and compute nodes are down due to a power outage on the DC. More on this soon.
Hi there,
2020-06-17T23:08:13Z - gpu[002,012] not responding.
Machines not reachable remotely, either power problem (maybe related to the recent GPU upgrade, cf. ReqNodeNotAvail - normal behavior? ) or completely stuck at BIOS level.
Time to go to the UniDufour DC…
2020-06-18T15:41:00Z - gpu012 back into production, gpu002 waiting for PSU2 replacement.
2020-07-06T20:10:00Z - gpu002 back into production as well.
Thx, bye,
Luca
Hi there,
2020-06-18T09:41:00Z - server3 power redundancy loss.
Reason still unknown, PSU2 broke 20 seconds after taking the full charge from PSU1.
Given that this machine is a member of the ${HOME} storage, some ${HOME} were not available during the issue.
2020-06-18T09:44:00Z - The service is restored.
Thx, bye,
Luca
Hi there,
2020-06-18T13:59:00Z - leaf7 crashes.
Reason still unknown, InfiniBand connection lost for some nodes only, hard power cycle and everything back to normal.
2020-06-18T15:29:00Z - The service is restored.
Thx, bye,
Luca
Dear all,
One of the metadata server for the BeeGFS storing the $HOME
directories crashed on Saturday 20 June 2020 22:44
. It was restored this morning around 10:00.
During this time, some of the $HOME
were unavailable, resulting in this message if you tried to log in :
Could not chdir to home directory /home/users/x/<USERNAME>: Communication error on send
or this one Communication error on send
if you were already connected.
If you had jobs running, they probably failed and needs to be re-submitted to Slurm.
We are sorry for the inconvenience.
All the best,
Massimo
Hi there,
2020-07-10T11:48:00Z - slurmctld
crashes.
While adding back a GPU card on gpu005 I forgot to restart slurmctld
because the configuration files changed and this leads to a segfault.
2020-07-10T12:25:00Z - The service is restored.
Thx, bye,
Luca
Hi there,
2020-07-23T04:08:46Z - gpu[002] rebooted.
After a quick investigation, again a PSU problem (cf. Current issues on Baobab ). The machine is anyway usable while we check with DALCO (our upstream supplier).
Thx, bye,
Luca
Dear all,
One of the metadata server for the BeeGFS storing the $HOME
directories crashed at around 2020-09-23T14:04:00Z. The service was restored at around 2020-09-23T19:06:00Z
During this time, some of the files in your $HOME
were unavailable, resulting in this message if you tried to log in or access the files:
Could not chdir to home directory /home/users/x/<USERNAME>: Communication error on send
or this one Communication error on send
if you were already connected.
If you had jobs running, they probably failed and needs to be re-submitted to Slurm.
We are sorry for the inconvenience.
All the best,
HPC team
edit: a side effect is that all the nodes were put in drain, preventing to process new jobs. This is fixed as well since 2020-09-23T22:00:00Z