[bamboo] gpu008 inter-GPU communication not working

Primary informations

Username: calvogon
Cluster: bamboo

Description

I am trying to run a job that requires inter-GPU communications. I know that my code works, since it runs on any other node of bamboo just fine. But when running on gpu008, the job is frozen and does not run. I think this has to do with the fact that gpu008 might have faulty inter-GPU communications, since running on a single GPU works fine.

Steps to Reproduce

I’m trying to run jobs that require all 4 GPUs at the same time.