Primary informations
Username: calvogon
Cluster: bamboo
Description
I am trying to run a job that requires inter-GPU communications. I know that my code works, since it runs on any other node of bamboo just fine. But when running on gpu008, the job is frozen and does not run. I think this has to do with the fact that gpu008 might have faulty inter-GPU communications, since running on a single GPU works fine.
Steps to Reproduce
I’m trying to run jobs that require all 4 GPUs at the same time.