as just announced on the baobab-announce@ mailing list, we will do a software and hardware maintenance of the Baobab HPC cluster on Wednesday 26 August 2020 and Thursday 27 August 2020.
The maintenance will start at 08:00 +0100 and you will receive an email when the maintenance will be over.
The cluster will be totally unavailable during this period, with no access at all (not even to retrieve files).
If you submit a job in the meantime, be sure that the expected wall time (duration) does not overlap with the start of the maintenance or your job will be scheduled after the maintenance.
What should be done during this maintenance:
- hardware maintenance (electrical power and network)
- software upgrades (OS, Slurm plugins, etc.)
Thanks for your understanding.
the HPC team
As just announced on the baobab-announce@ mailing list, at first I forgot a fundamental change in the Slurm configuration which will cause the lost of all PENDING jobs.
Slurm provides the --gpus options to select more GPUs than a single node has and this options is provided by the select/cons_tres plugin (cf. Slurm Workload Manager - Generic Resource (GRES) Scheduling and Slurm Workload Manager - slurm.conf ).
However, Baobab is currently using the select/cons_res plugin and we thus must change the current configuration.
Sorry for the inconvenience.
the HPC team
The maintenance is now over ! And you can use Baobab again.
What is new on Baobab and what kept us busy during this maintenance :
- Slurm 20 : update from Slurm 19.05.7 to Slurm 20.02.4
- Slurm Trackable resources (TRES) support. This allows a more flexible use of GPUs, enabling new options for your jobs such are --gpus-per-node, --gpus-per-task, etc.
- Slurm daemon and database now entirely migrated on CentOS7
- HDF5 plugin for Slurm (SLURM - monitor resources during job - #3 by Luca.Capello)
- New software installed: HDFView/2.14-System-Java-centos7 (New software installed: HDFView/2.14-System-Java-centos7)
- Introduction of a “Health check” script that should help us detect performance problems on nodes
- Security patches and recommended updates on all servers and compute nodes
- Reinstallation of all the compute nodes
Please also note that no pending jobs have been lost during this maintenance, despite what was announced :
I forgot in the first announcement that there will be a fundamental
change in the Slurm configuration which will cause the lost of all
We will now keep working on the installation of Yggdrasil and we will keep you posted when it will be open for tests (we hope in the coming weeks) !
We wish you all the best,
Massimo, for The HPC team