Dear users,
as just announced on the baobab-announce@ mailing list, we will do a software and hardware maintenance of the Yggdrasil HPC cluster on Wednesday 19th of January until Friday 21th of January 2021 This is a 3day maintenance.
The maintenance will start at 08:00 +0100 and you will receive an email when the maintenance will be over.
The cluster will be totally unavailable during this period, with no access at all (not even to retrieve files).
If you submit a job in the meantime, be sure that the expected wall time (duration) does not overlap with the start of the maintenance or your job will be scheduled after the maintenance.
What should be done during this maintenance:
- upgrade of the cluster job scheduler (Slurm 21.08 with REST API)
- Upgrade BeeGFS to latest version
- security and bugfix upgrade, re-installation of all the nodes
- update BIOS on GPU servers
Thanks for your understanding.
Best regards,
the HPC team
Dear admins,
What is the status? Can we expect Yggdrasil during the weekend or rather in the next week?
Best regards,
Maciej Falkiewicz
Dear users
What was done during this maintenance:
- same CPU generation identification used on Baobab and Yggdrasil hpc:hpc_clusters [eResearch Doc]
- GPU resource renamed on Yggdrasil: we don’t have anymore the
rtx
resource as this wasn’t an architecture name but a model name that now exists in various GPU architecture. How to confuse customers
The new name is now turing
i.e the architecture name. We’ll do the same on Baobab soon.
- update Slurm to version 21.08.2
- update Mellanox and OFED to 4.9 LTS: for this to work, we had to recompile UCX for every toolchain. If you notice warning messages related to IB (infiniband) or UCX, please let us know
- update BeeGFS to 7.2.3
- public IP of
admin1.yggdrasil
and login1.yggdrasil
changed. Not relevant for most users if you are using hostname to connect to the cluster
- all the nodes reinstalled with latest CentOS 7.9 bugs and security fixes
*all the servers updated to latest CentOS 7.9 bugs and security fixes
- faulty DIMM RAM replaced on our admin server
- BIOS on GPUs nodes updated
The maintenance is still ongoing because we had to launch multiple benchmarks, including benchmarks involving disk storage. To get accurate results, we need to perform them without the users connected to the cluster.
The numbers we’ll get will help us to buy Baobab replacement. We’ll ask the vendors to perform the same benchmarks and and we’ll compare with our result.
As soon as the results are ok, we’ll open Yggdrasil to the users. Best case Saturday night. Thanks for your understanding, and this is for a good reason
Your HPC team Yann, Adrien and Rémy
1 Like
As just announced to the mailing list, the maintenance is now over.
Best