This post will contains the current and past issues related to the HPC infrastructure Baobab and Yggdrasil.
2020-03-13T00:29:00Z - Error accessing some files on home directories.
The storage of the home directories is composed by 4 servers with eight services. One of them crashed, and all the files hosted by this server weren’t accessible anymore.
2020-03-13T08:38:00Z - The service is restored
Hi there,
2020-03-31T22:13:15Z - SlurmDBD crashes.
Priority calculations and some accounting commands (e.g. sacct
) give an error.
2020-04-01T07:51:19Z - The service is restored.
Thx, bye,
Luca
Hi there,
2020-04-01T07:56:27Z - SlurmDBD crashes.
Priority calculations and some accounting commands (e.g. sacct
) give an error.
2020-04-01T12:33:19Z - The service is restored.
Thx, bye,
Luca
Hi there,
2020-04-01T13:06:22Z - SlurmDBD crashes.
Priority calculations and some accounting commands (e.g. sacct
) give an error.
2020-04-01T14:56:40Z - The service is restored.
Thx, bye,
Luca
2020-04-04T07:40:00Z → 2020-04-05T09:15:00Z- Baobab Login2 unreachable.
Baobab login node2 was powered off automatically due to a thermal issue Saturday morning. Restarted now and available again.
2020-04-05T17:35:00Z Baobab Login2 unreachable.
I restarted it this morning. We have a lot of warning about high temperature on one of the CPU package. We need to investigate but for this we need to shutdown the server. We are currently preparing a replacement login node and let you know when we’ll do the swap.
2020-04-06T22:00:00Z → 2020-04-13T22:00:00Z Some nodes had an issue to access Baobab and the private storage “space”.
slurmstepd: error: execve(): matlab: No such file or directory
It’s now fixed.
Best
2020-05-28T07:20:00Z → 2020-05-28T09:45:00Z
Cooling issue on the datacentre. All nodes down.
Now fixed. Cluster in production again.
2020-06-01T22:00:00Z
Some GPU and compute nodes are down due to a power outage on the DC. More on this soon.
A post was split to a new topic: Performance issue with GPU nodes on cui-gpu-EL7
Hi there,
2020-06-17T23:08:13Z - gpu[002,012] not responding.
Machines not reachable remotely, either power problem (maybe related to the recent GPU upgrade, cf. ReqNodeNotAvail - normal behavior? ) or completely stuck at BIOS level.
Time to go to the UniDufour DC…
2020-06-18T15:41:00Z - gpu012 back into production, gpu002 waiting for PSU2 replacement.
2020-07-06T20:10:00Z - gpu002 back into production as well.
Thx, bye,
Luca
Hi there,
2020-06-18T09:41:00Z - server3 power redundancy loss.
Reason still unknown, PSU2 broke 20 seconds after taking the full charge from PSU1.
Given that this machine is a member of the ${HOME} storage, some ${HOME} were not available during the issue.
2020-06-18T09:44:00Z - The service is restored.
Thx, bye,
Luca
A post was split to a new topic: gpu[002,012] still down at 16:00 on 2020-06-18
Hi there,
2020-06-18T13:59:00Z - leaf7 crashes.
Reason still unknown, InfiniBand connection lost for some nodes only, hard power cycle and everything back to normal.
2020-06-18T15:29:00Z - The service is restored.
Thx, bye,
Luca
Dear all,
One of the metadata server for the BeeGFS storing the $HOME
directories crashed on Saturday 20 June 2020 22:44
. It was restored this morning around 10:00.
During this time, some of the $HOME
were unavailable, resulting in this message if you tried to log in :
Could not chdir to home directory /home/users/x/<USERNAME>: Communication error on send
or this one Communication error on send
if you were already connected.
If you had jobs running, they probably failed and needs to be re-submitted to Slurm.
We are sorry for the inconvenience.
All the best,
Massimo
Hi there,
2020-07-10T11:48:00Z - slurmctld
crashes.
While adding back a GPU card on gpu005 I forgot to restart slurmctld
because the configuration files changed and this leads to a segfault.
2020-07-10T12:25:00Z - The service is restored.
Thx, bye,
Luca
Hi there,
2020-07-23T04:08:46Z - gpu[002] rebooted.
After a quick investigation, again a PSU problem (cf. Current issues on Baobab ). The machine is anyway usable while we check with DALCO (our upstream supplier).
Thx, bye,
Luca
3 posts were split to a new topic: Question about mono-EL7 and shared-EL7 partitions usage
Dear all,
One of the metadata server for the BeeGFS storing the $HOME
directories crashed at around 2020-09-23T14:04:00Z. The service was restored at around 2020-09-23T19:06:00Z
During this time, some of the files in your $HOME
were unavailable, resulting in this message if you tried to log in or access the files:
Could not chdir to home directory /home/users/x/<USERNAME>: Communication error on send
or this one Communication error on send
if you were already connected.
If you had jobs running, they probably failed and needs to be re-submitted to Slurm.
We are sorry for the inconvenience.
All the best,
HPC team
edit: a side effect is that all the nodes were put in drain, preventing to process new jobs. This is fixed as well since 2020-09-23T22:00:00Z