Current issues on Baobab and Yggdrasil

Hi there,

2020-09-27T15:32:00Z - ${HOME} unavailable.

The metadata server that crashed last Thursday (cf. Current issues on Baobab ) experienced another crash yesterday afternoon.

2020-09-28T07:12:00Z - The service is restored.

Thx, bye,
Luca

1 Like

Hi there,

2020-10-08T10:20:00Z - ${HOME} slow.

The storage servers for the ${HOME} folders was under heavy write stress and is slowly recovering, thus you could expect some latency when connecting to the login node.

The storage servers for the ${SCRATCH} folders have not been experiencing any important write operation and thus they are not affected.

2020-10-08T13:00:00Z - write operations on ${HOME} back to normal level.

Thx, bye,
Luca

Dear all,

You might have problem connecting to Baobab this morning.

The exact cause is still under investigation, but the first clues seem to indicate an IPv6 problem on UNIGE’s network.
Baobab is simply affected as a collateral damage. Until this is solved, we have implemented a workaround that should allow you to work and connect as usual.

Cheers,

Massimo Brero

Hello,

we had storage issue on the scratch fs on Baobab 2020-12-04T11:00:00Z → 2020-12-04T11:05:00Z

one of the server was powered off for an unknown reason while doing some work in the datacentre.

The service is restored.

Hello,

many compute nodes are in drain for an uknown reason. We are investing.

2021-01-22T09:00:00Z

Fixed. The reason was a user job involving a lot of I/O on the nodes.

2021-01-22T14:09:00Z

Dear all,

2021-01-27T14:32:00Z

login2.baobab.hpc.unige.ch is currently unavailable/unreachable.

We are working on the issue and will keep you posted when we know more about it.

Thanks for your patience.

2021-01-27T15:00:00Z

login2.baobab.hpc.unige.ch is back online.

We excluded a problem with BeeGFS, and the nodes were not affected.
login2 became completely unresponsive and a user process seem to be the cause of this failure. Unfortunately, we couldn’t identify a specific culprit.

However, this should serve as a reminder that you should NEVER run anything from the login node and that tests should be done on the public-debug partition.
People who were working on login2 when the crash occurred should be particularly careful.
We thank you for your cooperation.

Massimo Brero

Hi there,

2021-01-31T23:43:00Z - slurmctld unavailable on Yggdrasil.

An unexpected error with the Slurm scheduler on Yggdrasil did not allow any job submission.

2021-02-01T10:20:00Z - The service is restored.

Thx, bye,
Luca

Dear all,

2021-02-02T07:41:00Z - scratch unavailable on Yggdrasil

The scratch storage ( /srv/beegfs/scratch ) in Yggdrasil is currently unavailable for an unknown reason. We are working on it. First clues indicates the failure might have occured yesterday evening
(2021-02-01T22:22:00Z)

We will keep you posted.

2021-02-02T08:17:00Z - scratch restored on Yggdrasil

The BeeGFS metadata service crashed yesterday (2021-02-01T22:22:00Z)
as a fail-safe to avoid data corruption. The service is now restored.

Please be aware that if your jobs tried to read/write anything from the scratch space since last night, they would probably have failed or raise errors. So double check your results and make sure they make sense.

N.B. home on Baobab/Yggdrasil + scratch on Baobab were NOT affected by this issue.

Thank you for your understanding.

Massimo

Dear HPC users,

2021-02-09T09:45:00Z

We noticed a problem accessing the login node on Baobab.

The issue might be hardware related, but it’s too soon to tell. We are investigating and will let you know when the issue is solved.

2021-02-09T10:00:00Z

It seems one of the Ethernet switch in the datacenter has stopped working or is in an unstable state.

Please consider Baobab down for the time being.

2021-02-09T11:20:00Z
Boabab is up and running again. You can resume using Baobab normally.
You might want to check the result of your last running jobs as a majority of them were killed.

The exact cause of the failure is still under investigation, but the first clues point at a network issue.

Yggdrasil was not affected by this incident.

2021-02-09T11:50:00Z
If you have any problem with module (load, spider, etc.), you need to exit your session and re-open it :

Thank you for your patience.

Massimo

2 Likes

A post was split to a new topic: Spider inaccessible today

Dear HPC users,

2021-02-10T14:54:00Z

Many of you already contacted us about this issue.

Every HPC users at UNIGE received hundreds of emails today between 11h00 and ~15h30. The subject was :

[Yggdrasil] Job XXXXX will never run

This is a mistake and we are very sorry for the inconvenience!

You can safely delete those emails.

Multiple reasons (and just a hint of Murphy’s law) caused this mass mailing this morning.

It seems the last Slurm update we installed this morning (during Yggdrasil maintenance) introduced a new “reason” to explain why a job is pending. And while another Slurm service was not running (because it was being updated), our script to detect and notify users of their pending jobs was launched… at a very bad timing.
This new “reason” was not filtered in the script and triggered this mass mailing.

The script has been updated and everything is corrected now and this hopefully shouldn’t happen anymore. However, all the emails have been release from the mail server and there is nothing we can do to stop them.

2021-02-10T15:38:00Z
We understand everyone’s frustration about this flood of email spamming you.
Please understand these email left Yggdrasil this morning between 11h02 and 11h05. We do not have any control on them at this point. We have already contacted UNIGE’s postmaster to ask them to stop whatever can be stopped.

2021-02-10T17:18:00Z
With the help of the Postmaster, we eventually managed to put an end to this mass mailing.

The emails left Yggdrasil between 11h02 and 11h05 this morning. It’s like if you send an email by mistake, you can’t just “take it back”. The same happened but on a very large scale. So the HPC team had no control or any way to stop them anymore.

Most of the email have been stuck in a queue on the mail servers and were release around 15h. From this point on, they flooded the mail system for the next hours.
Around 17h45, thousands of remaining emails were identified and blocked on the mail queues.
Eventually, at 18h18 they all have been deleted. You shouldn’t have received any spam since that time.

We thank all of you for your understanding and we apologize again for the inconvenience.

Massimo Brero

1 Like

2021-02-09T23:00:00Z → 2021-02-16T23:00:00Z

Dear users,

we were contacted by some of you because some jobs stays in the queue with the Reason

(launch failed requeued held)

The reason was an incompatibility of slurm between two versions. We have now updated Slurm on Baobab as well and this should fix the issue.

I have released the jobs. In case they didn’t worked, please submit your job again.

Best

Yann

Dear users,

2021-03-02T08:40:00Z
Baobab has encountered a problem. We are currently investigating the issue.
N.B. this is unrelated to tomorrow’s planned maintenance.

2021-03-02T08:45:00Z
One of our management server had an issue. The problem is now fixed.
We are still investigating the cause of the issue.
Some folders might have been unavailable for a few minutes during the above mentioned problem.
You can continue using Baobab normally until tomorrow’s maintenance.
N.B. running jobs were not necessarily affected. They might have hang on a read/write operation for a few minutes and appeared to be frozen, but they might have simply carried on when the problem was fixed.

All the best,

Massimo

Hi there,

2021-03-11T07:59:00Z - admin server erroneously rebooted on Baobab.

Wrong manipulation from my side, sorry for the inconvenience.

We experienced some glitches during the reboot that caused such a long downtime.

2021-03-11T09:59:00Z - The service is restored.

Thx, bye,
Luca

Dear HPC users,

2021-03-12T14:50:00Z we had an issue with one of the administration server that froze.

2021-03-12T14:58:00Z It has been rebooted and is now working normally.

Since this is not the first incident related to this (old) server, we are going to replace it in the coming days/weeks. We will of course try to minimize the inconvenience as much as possible.

Best regards,

Massimo

Hi there,

2021-03-15T14:01:00Z - /dpnc/beegfs not available on Baobab.

Side effect of an autofs configuration fix, see /dpnc/beegfs not mounted on Baobab nodes - #2 by Luca.Capello .

2021-03-16T16:31:00Z - /dpnc/beegfs accessible again as NFSv4.

Thx, bye,
Luca

Hi there,

2021-03-24T18:14:00Z - Cannot login to Baobab, communication error on send.

Investigation ongoing (cf. Cannot login, communication error on send ), either a BeeGFS error or the administration server stuck again.

2021-03-25T08:48:00Z - administration server stuck, rebooting.

The problem was actually first reported yesterday at 19:04.

2021-03-25T08:54:00Z - administration server back on business, login again available.

The nodes were automatically put on DRAIN and are being slowly RESUMEd.

2021-03-25T09:07:00Z - all the affected nodes have been RESUMEd, Baobab cluster back to normal.

Thx, bye,
Luca

Hi there,

2021-03-25T13:58:00Z - network error on the administration server.

Restart in progress.

2021-03-25T14:06:00Z - Baobab cluster back to normal.

Thx, bye,
Luca

Hi there,

2021-04-04T22:48:00Z - no space left on login1.yggdrasil:/.

This generates the following error: cannot create temp file for here-document: No space left on device .

Investigation in progress.

Long-standing files in /tmp have been cleaned.

2021-04-06T09:25:00Z - The service is restored.

Thx, bye,
Luca

Hi there,

2021-04-06T08:30:00Z - I/O error accessing baobab:/srv/beegfs/scratch.

Investigation in progress.

2021-04-06T09:57:00Z - One of the storage servers lost contact with the JBOD, restart en cours .

2021-04-06T10:37:00Z - Hardware error on the storage server, more investigation needed, cluster unavailable.

The diagnostic tests did not reveal any possible cause and a full hardware reset (including unplugging the current expansion cards) fixed it.

Nevertheless, we have already contacted the server supplier for further investigation.

2021-04-06T14:21:00Z - The service is restored and the Baobab cluster is again available.

Thx, bye,
Luca