Yggdrasil scheduled maintenance: 19-20th of January 2022

Dear users,

as just announced on the baobab-announce@ mailing list, we will do a software and hardware maintenance of the Yggdrasil HPC cluster on Wednesday 19th of January until Friday 21th of January 2021 This is a 3day maintenance.

The maintenance will start at 08:00 +0100 and you will receive an email when the maintenance will be over.

The cluster will be totally unavailable during this period, with no access at all (not even to retrieve files).

If you submit a job in the meantime, be sure that the expected wall time (duration) does not overlap with the start of the maintenance or your job will be scheduled after the maintenance.

What should be done during this maintenance:

  • upgrade of the cluster job scheduler (Slurm 21.08 with REST API)
  • Upgrade BeeGFS to latest version
  • security and bugfix upgrade, re-installation of all the nodes
  • update BIOS on GPU servers

Thanks for your understanding.

Best regards,
the HPC team

Dear admins,

What is the status? Can we expect Yggdrasil during the weekend or rather in the next week?

Best regards,
Maciej Falkiewicz

Dear users

What was done during this maintenance:

  • same CPU generation identification used on Baobab and Yggdrasil hpc:hpc_clusters [eResearch Doc]
  • GPU resource renamed on Yggdrasil: we don’t have anymore the rtx resource as this wasn’t an architecture name but a model name that now exists in various GPU architecture. How to confuse customers :upside_down_face:
    The new name is now turing i.e the architecture name. We’ll do the same on Baobab soon.
  • update Slurm to version 21.08.2
  • update Mellanox and OFED to 4.9 LTS: for this to work, we had to recompile UCX for every toolchain. If you notice warning messages related to IB (infiniband) or UCX, please let us know
  • update BeeGFS to 7.2.3
  • public IP of admin1.yggdrasil and login1.yggdrasil changed. Not relevant for most users if you are using hostname to connect to the cluster
  • all the nodes reinstalled with latest CentOS 7.9 bugs and security fixes
    *all the servers updated to latest CentOS 7.9 bugs and security fixes
  • faulty DIMM RAM replaced on our admin server
  • BIOS on GPUs nodes updated

The maintenance is still ongoing because we had to launch multiple benchmarks, including benchmarks involving disk storage. To get accurate results, we need to perform them without the users connected to the cluster.
The numbers we’ll get will help us to buy Baobab replacement. We’ll ask the vendors to perform the same benchmarks and and we’ll compare with our result.

As soon as the results are ok, we’ll open Yggdrasil to the users. Best case Saturday night. Thanks for your understanding, and this is for a good reason :innocent:

Your HPC team :nerd_face: Yann, Adrien and Rémy

1 Like

As just announced to the mailing list, the maintenance is now over.