Are all the gpu nodes wired via a fast interlink like infiniband?
Or are they limited to 10G / 1G? I wasn’t able to find any info about this here. If some are wired via inifniband, which gpu* nodes would these be?
I’m trying a few pytorch distributed-data-parallel jobs via TCP and just want to know what to expect.
Thanks!
@Yann.Sagon answered this in the lunch HPC meetup. The nodes are connected by infiniband, but singularity needs extra work to get it working. Marking this thread as closed and will follow up with another thread regarding distributed training.
Thanks Yann!
A quick hack may be to “talk” to the nodes through their infiniband TCP network interface instead of their ethernet network. For example, from node001, you can “talk” to node002 (eth 1G) or node002i (IB 40G). This should probably work out of the box. I think there is only something to do in Singularity if you want to support RDMA.
Ah neat! This would be a good workaround for not having the proper driver in Singularity. I do believe pytorch supports RDMA since it is backed into NCCL 2.5+, but not sure what sort of overhead that would have. I can benchmark and report back though
Unfortunately I ran into another issue (sent an email to HPC-support) as the --gpus-per-task
is not currently supported and the only way to get >8 gpus is with job-arrays.
Hello, indeed, and it’s a change that would kill all the running and pending jobs on Baobab. So unless we have a big issue before, we don’t plan to do the change before launching Yggdrasil. Is there any other way to solve your issue using only gres?
Is there an equivalent --gres
/ <other_flag> command that can be used to request 1 GPU per 1 task for a setup requiring > 8 GPUs? I tried the following but it only requested 1 GPU for all the tasks:
#!/bin/bash -l
#SBATCH --job-name=SOTAVAE
#SBATCH --ntasks=9
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1
#SBATCH --partition=shared-gpu-EL7
#SBATCH --time=12:00:00
#SBATCH --mem=16000
#SBATCH --constraint="COMPUTE_CAPABILITY_6_0|COMPUTE_CAPABILITY_6_1"
srun --ntasks=9 --exclusive --multi-prog distributed.conf
I also tried setting #SBATCH --gres=gpu:9
but as expected this threw a Node does not exist error.
Moving this discussion to a new thread since the original question of this topic has been answered
Update of my post: it is now possible to use --gpus-per-task
on Yggdrasil and Baobab.