Job array spread across multiple clusters

Ludovic.Dumoulin · October 10, 2023, 1:51pm

Hello,

I’m interested in determining whether it’s possible to initiate a job array that can utilize the GPUs available on both the Baobab and Bamboo clusters.

For instance, my simulations are optimized for A100 GPUs. If I were to start a job array on Baobab (as it has my lab’s private GPUs), would I be able to access the GPU nodes on Bamboo? I have not previously used Yggdrasil due to the absence of A100 GPUs, and I’m uncertain if there’s a straightforward approach to harnessing resources from both clusters. If there isn’t a straightforward method, splitting similar resources across multiple clusters could potentially pose challenges for users.

Thank you for your assistance.

(Additionally, utilizing the same type of resources enables more accurate predictions for the required processing time.)

Yann.Sagon · October 10, 2023, 2:47pm

Hi @Ludovic.Dumoulin, thanks for asking.

Right now this isn’t possible. We need to enable cluster federation first. We are evaluating how to do that in our environment: we need to “move” some of our physical servers outside of the clusters, quite challenging but interesting. Once this will be done, the issue you’ll face is that you need to have the data on every cluster or stored on a shared storage available from all the clusters. Another challenge:)

Indeed, the issue is physical constraint: we don’t have enough power in a single datacenter to host the three cluster or a big one.

We’ll keep you posted, this is our next “big step”.

Ludovic.Dumoulin · October 11, 2023, 8:50am

Thank you for your answer !
It could potentially be more user-friendly if the various clusters were designated for specific primary purposes. For instance, having one cluster tailored for GPUs and/or CPU-bigmem tasks, another focused on standard CPU usage, and a third dedicated to debug and interactive sessions. With such a setup, users could select the appropriate cluster based on their specific requirements, enhancing overall usability and convenience.

Yann.Sagon · October 13, 2023, 12:02pm

Thanks for the suggestion. The main issue is still that we are lacking a shared filesystem reachable from every cluster (and usable to do the computation of course).

Best