Baobab scheduled special maintenance: 10-12 September 2024

Dear users,

as just announced on the baobab-announce@ mailing list, we will be performing special hardware maintenance on the Baobab HPC cluster from 10th to 12th September 2024 (inclusive).

The maintenance will start at 08:00 +0100 and you will receive an email when the maintenance is finished.

The cluster will be completely unavailable during this time, with no access whatsoever (not even to retrieve files).

If you submit a job in the meantime, make sure that the expected wall time (duration) does not overlap with the start of the maintenance, or your job will be scheduled after the maintenance.

What will be done during this maintenance:

We’ll be upgrading our Infiband fast networking stack from 40Gb QDR to 100Gb EDR. We already replaced ~150 Infiband network cards in our compute nodes and servers a few weeks ago, and now we’ll replace all our Infiband switches.

This will allow us to keep the cluster up to date and improve performance.

This isn’t replacing our routine maintenance which is planed from 2th to 3th October 2024 (an email with details will be sent)
Thank you for your understanding.

Best regards,
The HPC Team

Dear users,

The maintenance is over. This was the longest maintenance we have ever done: three full days in the data centre with Adrien, Gaël and myself.

We replaced 13 Infiniband switches by faster ones. We also took the opportunity to reroute all the Ethernet and power cables to ensure better cooling of the compute nodes.

The benefit for you: better bandwidth between nodes for access to storage and computation.

The benefit for us: easier to manage compute nodes, fewer network issues

Thank you for your patience, best regards

Yann for the HPC team

2 Likes

Some random pictures of Baobab and us working in the DC





7 Likes