New cluster Bamboo and important informations to buy private nodes

Dear PI, dear users,

As you may have already seen on the forum, the new cluster name, elected by the community, will be Bamboo!

This cluster will replace the legacy cluster Baobab, which was put in production back in 2013! Yes time is flying!

We did a public tender to buy the new cluster and we received three offers. The one who won is the one from Dalco. Special thanks to Jean-Luc Falcone (CUI) for the help with the tender.

The new cluster will be similar to Baobab and Yggdrasil but AMD EPYC 7742 based and will be hosted physically in campus Biotech.

Summary of the cluster

  • 1 login node 2 x AMD EPYC 7742 2.25GHz 64 cores 512GB RAM
  • 1 admin node 2 x AMD EPYC 7742 2.25GHz 64 cores 512GB RAM
  • 43 compute nodes fitted with 512GB RAM and 2 x AMD EPYC 7742 2.25GHz for a total of 5504 CPUs
  • 2 compute nodes “bigmem” fitted with 1TB RAM and 2 x AMD EPYC 72F3 3.7GHz
  • 2 compute node GPU “single precision” with 8 x RTX 3090 24GB RAM GPUs each
  • 1 compute node GPU “double precision” with 4 x A100 80GB RAM GPUs
  • 1PB regular storage (spinning disks + SSD for metadata)
  • 400TB fast storage (SSD for metadata and storage)
  • fast interconnect between nodes : Infiniband EDR 100G for storage and MPI.

Investment in the new cluster

Throughout the year, we receive a lot of requests through the COINF or directly from research groups who wants to buy private compute nodes to add to Yggdrasil or Baobab.

We’ll like to optimize this situation and to have 2 time slots during the year when it would be possible to buy extra nodes. This would have several advantages:

  • better price negotiation
  • less overhead for us : quote, order, store, installation, configuration, organization etc.

We propose that research group interested to buy extra nodes for the next “slot” contacts us by email (hpc@unige.ch) with the following details:

  • amount of money willing to invest in the cluster
  • type of hardware interested to buy: compute node, GPUs node, storage, other.
  • deadline for the money to be spent

Deadline for the request: 13th of May 2022

Roadmap

  • install the nodes already bought as Baobab extension in new Racks : May 2022.
  • Awaiting research group purchase requests, deadline 13th of May 2022
  • order confirmation with private nodes included: end of May 2022
  • cluster installation in Biotech: October 2022
  • move recent enough nodes and storage from Baobab to Bamboo: end of 2022
  • decommission remaining of Baobab: current 2023

Pending issues

  • We need to move some nodes from Baobab to another location in our datacentre due to too much heat produced in a single spot. We are awaiting that the new racks ordered by the DiSTIC are operational. Probably end of April
  • We need to negotiate with the Biotech foundation what power and cooling we can use for Bamboo. We’ll probably have to put Bamboo in a new POD as the heat produced will be high.

Your HPC team

Hi,
Thanks you for this exciting news. For groups that have “private” nodes on baobab, can you give (probably privately) more details on whether and/or when their nodes will be transfered and whether there is room for negociations (what is a "recent enough nodes "?) ?
Thanks,
Olivier

Dear PI, dear users,

Many of you are asking about the status of our new HPC cluster Bamboo, it’s time to give you some news. Here’s a brief summary of the key points:

Achievements:

  1. Agreement with Biotech: Obtaining the agreement to host Bamboo in the Biotech datacenter was a complex process (multiple entities involved, technical issues, fears :scream: etc) but has been successfully accomplished. :tada:
  2. Datacenter Preparation: Several tasks have been completed to prepare the datacenter for Bamboo’s installation, including electrical preparation, physical preparation, and network setup.
  3. Racks Installation: Eight racks have been installed, ensuring the minimal disturbance of the datacenter. The installation is nearly complete, with only the electrical door and roof remaining. (see photos below)

Challenges:

  1. Smoke Detector Relocation: There is a need to relocate the smoke detector within the datacenter due to its current placement within the containment.

Upcoming Steps:

  1. Cluster Installation: The next steps include receiving and installing the cluster within the containment.
  2. Configuration and Production: Once installed, the cluster will be configured and put into production. A brief period will be allocated for monitoring and ensuring smooth operation.

Baobab Evolution:

  1. Node Relocation: Some nodes from Baobab have been moved to another location in the datacenter to address heat-related issues caused by concentrated server clusters.
  2. Decommissioning Old Nodes: Older nodes on Baobab have been decommissioned to make space for newer nodes, reducing the pressure on Baobab decommission.
  3. Future Migrations: After Bamboo has been operational for a few weeks, a decision will be made regarding the migration of nodes from Baobab and Yggdrasil to Bamboo. There are also plans to upgrade Baobab to extend its lifespan.

We will continue to keep you updated on the progress, and please feel free to reach out if you have any questions or concerns. We look forward to the successful deployment of Bamboo and the improved performance it will bring to our research endeavors.

Best regards,

HPC team

2 Likes

A post was split to a new topic: Job array spread across multiple clusters