Hi there,
2021-07-12T03:29:00Z - 1x GPU lost on gpu018.baobab .
Investigation en cours .
New monitoring test deployed to automatically put the node in DRAIN if this happens again, since this is the second occurrence on Baobab (cf. [Solved] Is there a problem with gpu014? Baobab - #2 by Yann.Sagon ).
2021-07-12T13:33:00Z - Waiting for the node to be fully DRAINed to run a gpu_burn
test.
2021-07-21T09:15:00Z Node rebooted, gpu_burn
running.
2021-07-23T14:43:00Z gpu_burn
finished, gpu018.baobab back in business.
Thx, bye,
Luca
Hi there,
2021-03-22T13:35:00Z - Wrong CUDA architecture on gpu017.baobab (RTX instead of Ampere).
Fix en cours (cf. [Baobab] New nodes installed: gpu[017] - #2 by Giuseppe.Chindemi ).
2021-07-22T09:56:00Z - Slurm configuration fixed, still wrong Gres=gpu:rtx:8 .
Investigation en cours ā¦
2021-07-22T15:52:00Z - slurmctld
restarted and Gres=gpu:ampere:8 for gpu017.baobab .
Thx, bye,
Luca
Hi there,
2021-07-15T14:58:00Z - No InfiniBand connection on gpu019.baobab .
2021-07-26T14:25:00Z - Port 0 possible broken while port 1 works, hardware supplier contacted.
2021-07-27T12:32:00Z - Waiting for replacement InfiniBand card.
2021-07-29T12:03:00Z - Broken InfiniBand card replaced and gpu019.baobab back into production.
Thx, bye,
Luca
Hi,
master server down, all the nodes down, probably a welcome message from the cluster for my first day of work:(. Maybe it wanted vacations too.
start: 2021-08-16T02:00:00Z
fixed: 2021-08-16T07:00:00Z
same issue again at 2021-08-16T15:00:00Z
same issue again at 2021-08-17T08:00:00Z
Hi,
the cluster really doesnāt like not being the centre of attention!
The server hosting /home
has gone down.
Cheers,
Johnny
Hi, the reason is still the same, master server down. This server hosts all the critical services and the storage is impacted as well when it is unavailable. We are in the process of migrating this server to a new hardware and software but we arenāt ready yet.
Hi there,
2021-08-29T09:08:00Z - Remote I/O error on Baobab ${HOME}.
2021-08-30T07:33:00Z - No error from the BeeGFS management tools.
Investigation en cours .
2021-08-30T08:44:00Z - SCSI errors on home4 storage disks.
The machine must be restarted, thus more jobs could fail, ETA unknown.
2021-08-30T09:44:00Z - Baobab ${HOME} back to normal operations.
Thx, bye,
Luca
1 Like
Hi there,
2021-11-15T11:00:00Z - login2.baobab rebooted (error from my side).
2021-11-15T11:05:00Z - login2.baobab back online.
Thx, bye,
Luca
Hi,
we had a storage issue on Yggdrasil scratch storage. Even if this happened during Baobab maintenance, this is totally unrelated and out of luck, just to bother us
Iāve restarted scratch server and the issue seems to be solved. We are keeping an eye on it.
Best
Start: 2021-12-01T02:40:00Z
End: 2021-12-01T15:45:00Z
Hi,
we have an issue with the home storage on Yggdrasil. One of the server isnāt responding anymore. We are investigating.
start: 2021-12-16T14:30:00Z
end: 2021-12-16T15:55:00Z
We need to change a RAM on the server to fix the issue, this should be done on Monday.
Service restored with the faulty RAM for now.
edit: RAM DIMMās changed by @Remy.Ressegaire this morning on both servers.
Hi, Baobab is really really slow today whatever Iām trying to do like accessing it, changing directory, ecc
1 Like
Hi,
we had storage issue on the home directory.
Start: 2021-12-20T20:30:00Z
Stop: 2021-12-21T16:00:00Z
The reason was a user who launched many jobs that were performing too many IOs on files. We killed this userās job and contacted him asking for precision about the jobs.
Best
Yann
Yann.Sagon
Split this topic
January 11, 2022, 10:22am
73
A post was split to a new topic: Issue to run interactive jobs
We have interactive job issue on Yggdrasil. The symptoms are that it isnāt possible to use srun
or salloc
for some users. sbatch
seems to work as usual.
Start: 2022-01-03T23:00:00Z
Stop: 2022-01-12T11:00:00Z
issue filled at schedmd: 13165 ā srun: job xxx queued and waiting for resources
edit: solved. This was a routing issue.
Hi, Baobab is really really slow at the moment whatever Iām trying to do like accessing it, changing directory, ecc
Is it me or Yggdrasil is less reliable than baobab these days?
I just got kicked out of Yggdrasil and canāt get back in!
Was there maintenance planned for today? If yes, sorry if I missed the announcement!
I just got kicked out of Baobab and canāt get back in either.
Yes, actually, I got kicked out of baobab as well
1 Like
Same here, got kicked ou of baobab, file tranfer down too
While I was on Baobab ā¦
was maintenance planned for today ?