Hi there,
2020-09-27T15:32:00Z - ${HOME}
unavailable.
The metadata server that crashed last Thursday (cf. Current issues on Baobab ) experienced another crash yesterday afternoon.
2020-09-28T07:12:00Z - The service is restored.
Thx, bye,
Luca
Hi there,
2020-09-27T15:32:00Z - ${HOME}
unavailable.
The metadata server that crashed last Thursday (cf. Current issues on Baobab ) experienced another crash yesterday afternoon.
2020-09-28T07:12:00Z - The service is restored.
Thx, bye,
Luca
Hi there,
2020-10-08T10:20:00Z - ${HOME}
slow.
The storage servers for the ${HOME}
folders was under heavy write stress and is slowly recovering, thus you could expect some latency when connecting to the login node.
The storage servers for the ${SCRATCH}
folders have not been experiencing any important write operation and thus they are not affected.
2020-10-08T13:00:00Z - write operations on ${HOME}
back to normal level.
Thx, bye,
Luca
Dear all,
You might have problem connecting to Baobab this morning.
The exact cause is still under investigation, but the first clues seem to indicate an IPv6 problem on UNIGEâs network.
Baobab is simply affected as a collateral damage. Until this is solved, we have implemented a workaround that should allow you to work and connect as usual.
Cheers,
Massimo Brero
Hello,
we had storage issue on the scratch fs on Baobab 2020-12-04T11:00:00Z â 2020-12-04T11:05:00Z
one of the server was powered off for an unknown reason while doing some work in the datacentre.
The service is restored.
Hello,
many compute nodes are in drain for an uknown reason. We are investing.
2021-01-22T09:00:00Z
Fixed. The reason was a user job involving a lot of I/O on the nodes.
2021-01-22T14:09:00Z
Dear all,
2021-01-27T14:32:00Z
login2.baobab.hpc.unige.ch is currently unavailable/unreachable.
We are working on the issue and will keep you posted when we know more about it.
Thanks for your patience.
2021-01-27T15:00:00Z
login2.baobab.hpc.unige.ch is back online.
We excluded a problem with BeeGFS, and the nodes were not affected.
login2 became completely unresponsive and a user process seem to be the cause of this failure. Unfortunately, we couldnât identify a specific culprit.
However, this should serve as a reminder that you should NEVER run anything from the login node and that tests should be done on the public-debug
partition.
People who were working on login2 when the crash occurred should be particularly careful.
We thank you for your cooperation.
Massimo Brero
Hi there,
2021-01-31T23:43:00Z - slurmctld
unavailable on Yggdrasil.
An unexpected error with the Slurm scheduler on Yggdrasil did not allow any job submission.
2021-02-01T10:20:00Z - The service is restored.
Thx, bye,
Luca
Dear all,
2021-02-02T07:41:00Z - scratch unavailable on Yggdrasil
The scratch storage ( /srv/beegfs/scratch
) in Yggdrasil is currently unavailable for an unknown reason. We are working on it. First clues indicates the failure might have occured yesterday evening
(2021-02-01T22:22:00Z)
We will keep you posted.
2021-02-02T08:17:00Z - scratch restored on Yggdrasil
The BeeGFS metadata service crashed yesterday (2021-02-01T22:22:00Z)
as a fail-safe to avoid data corruption. The service is now restored.
Please be aware that if your jobs tried to read/write anything from the scratch space since last night, they would probably have failed or raise errors. So double check your results and make sure they make sense.
N.B. home
on Baobab/Yggdrasil + scratch
on Baobab were NOT affected by this issue.
Thank you for your understanding.
Massimo
Dear HPC users,
2021-02-09T09:45:00Z
We noticed a problem accessing the login node on Baobab.
The issue might be hardware related, but itâs too soon to tell. We are investigating and will let you know when the issue is solved.
2021-02-09T10:00:00Z
It seems one of the Ethernet switch in the datacenter has stopped working or is in an unstable state.
Please consider Baobab down for the time being.
2021-02-09T11:20:00Z
Boabab is up and running again. You can resume using Baobab normally.
You might want to check the result of your last running jobs as a majority of them were killed.
The exact cause of the failure is still under investigation, but the first clues point at a network issue.
Yggdrasil was not affected by this incident.
2021-02-09T11:50:00Z
If you have any problem with module
(load, spider, etc.), you need to exit your session and re-open it :
Thank you for your patience.
Massimo
A post was split to a new topic: Spider inaccessible today
Dear HPC users,
2021-02-10T14:54:00Z
Many of you already contacted us about this issue.
Every HPC users at UNIGE received hundreds of emails today between 11h00 and ~15h30. The subject was :
[Yggdrasil] Job XXXXX will never run
This is a mistake and we are very sorry for the inconvenience!
You can safely delete those emails.
Multiple reasons (and just a hint of Murphyâs law) caused this mass mailing this morning.
It seems the last Slurm update we installed this morning (during Yggdrasil maintenance) introduced a new âreasonâ to explain why a job is pending. And while another Slurm service was not running (because it was being updated), our script to detect and notify users of their pending jobs was launched⌠at a very bad timing.
This new âreasonâ was not filtered in the script and triggered this mass mailing.
The script has been updated and everything is corrected now and this hopefully shouldnât happen anymore. However, all the emails have been release from the mail server and there is nothing we can do to stop them.
2021-02-10T15:38:00Z
We understand everyoneâs frustration about this flood of email spamming you.
Please understand these email left Yggdrasil this morning between 11h02 and 11h05. We do not have any control on them at this point. We have already contacted UNIGEâs postmaster to ask them to stop whatever can be stopped.
2021-02-10T17:18:00Z
With the help of the Postmaster, we eventually managed to put an end to this mass mailing.
The emails left Yggdrasil between 11h02 and 11h05 this morning. Itâs like if you send an email by mistake, you canât just âtake it backâ. The same happened but on a very large scale. So the HPC team had no control or any way to stop them anymore.
Most of the email have been stuck in a queue on the mail servers and were release around 15h. From this point on, they flooded the mail system for the next hours.
Around 17h45, thousands of remaining emails were identified and blocked on the mail queues.
Eventually, at 18h18 they all have been deleted. You shouldnât have received any spam since that time.
We thank all of you for your understanding and we apologize again for the inconvenience.
Massimo Brero
2021-02-09T23:00:00Z â 2021-02-16T23:00:00Z
Dear users,
we were contacted by some of you because some jobs stays in the queue with the Reason
(launch failed requeued held)
The reason was an incompatibility of slurm between two versions. We have now updated Slurm on Baobab as well and this should fix the issue.
I have released the jobs. In case they didnât worked, please submit your job again.
Best
Yann
Dear users,
2021-03-02T08:40:00Z
Baobab has encountered a problem. We are currently investigating the issue.
N.B. this is unrelated to tomorrowâs planned maintenance.
2021-03-02T08:45:00Z
One of our management server had an issue. The problem is now fixed.
We are still investigating the cause of the issue.
Some folders might have been unavailable for a few minutes during the above mentioned problem.
You can continue using Baobab normally until tomorrowâs maintenance.
N.B. running jobs were not necessarily affected. They might have hang on a read/write operation for a few minutes and appeared to be frozen, but they might have simply carried on when the problem was fixed.
All the best,
Massimo
Hi there,
2021-03-11T07:59:00Z - admin server erroneously rebooted on Baobab.
Wrong manipulation from my side, sorry for the inconvenience.
We experienced some glitches during the reboot that caused such a long downtime.
2021-03-11T09:59:00Z - The service is restored.
Thx, bye,
Luca
Dear HPC users,
2021-03-12T14:50:00Z we had an issue with one of the administration server that froze.
2021-03-12T14:58:00Z It has been rebooted and is now working normally.
Since this is not the first incident related to this (old) server, we are going to replace it in the coming days/weeks. We will of course try to minimize the inconvenience as much as possible.
Best regards,
Massimo
Hi there,
2021-03-15T14:01:00Z - /dpnc/beegfs
not available on Baobab.
Side effect of an autofs
configuration fix, see /dpnc/beegfs not mounted on Baobab nodes - #2 by Luca.Capello .
2021-03-16T16:31:00Z - /dpnc/beegfs
accessible again as NFSv4.
Thx, bye,
Luca
Hi there,
2021-03-24T18:14:00Z - Cannot login to Baobab, communication error on send.
Investigation ongoing (cf. Cannot login, communication error on send ), either a BeeGFS error or the administration server stuck again.
2021-03-25T08:48:00Z - administration server stuck, rebooting.
The problem was actually first reported yesterday at 19:04.
2021-03-25T08:54:00Z - administration server back on business, login again available.
The nodes were automatically put on DRAIN and are being slowly RESUMEd.
2021-03-25T09:07:00Z - all the affected nodes have been RESUMEd, Baobab cluster back to normal.
Thx, bye,
Luca
Hi there,
2021-03-25T13:58:00Z - network error on the administration server.
Restart in progress.
2021-03-25T14:06:00Z - Baobab cluster back to normal.
Thx, bye,
Luca
Hi there,
2021-04-04T22:48:00Z - no space left on login1.yggdrasil:/
.
This generates the following error: cannot create temp file for here-document: No space left on device
.
Investigation in progress.
Long-standing files in /tmp
have been cleaned.
2021-04-06T09:25:00Z - The service is restored.
Thx, bye,
Luca
Hi there,
2021-04-06T08:30:00Z - I/O error accessing baobab:/srv/beegfs/scratch
.
Investigation in progress.
2021-04-06T09:57:00Z - One of the storage servers lost contact with the JBOD, restart en cours .
2021-04-06T10:37:00Z - Hardware error on the storage server, more investigation needed, cluster unavailable.
The diagnostic tests did not reveal any possible cause and a full hardware reset (including unplugging the current expansion cards) fixed it.
Nevertheless, we have already contacted the server supplier for further investigation.
2021-04-06T14:21:00Z - The service is restored and the Baobab cluster is again available.
Thx, bye,
Luca