Jobs on gpu044 regularly fail

Hi HPC team,

I have been having issues running jobs on gpu044 specifically. These are my usual pytorch with cuda jobs that work on all other nodes. My CUDA version is 11.8 so it should be above the limit required by the node.

The exact error message I get is:

RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

If you want to see the full path to the log file it is here.

/home/users/l/leighm/DiffBEIT/logs/5463965_2.out

I am not attempting to run on multiple gpus. I am requesting and trying to run on 1.
After chatting with my colleagues it seems they too get this error message output and we have taken to excluding the node (using --exclude=gpu044) from our submissions which then run fine on the other nodes.

It should also be noted that I did get one job running on gpu044 this morning, which was a surprise. However all others after that failed in the same way described above.

Thanks for the help.
Matthew Leigh

Please share your sbatch with us.

I have the same issue
I allocate using:
salloc -c8 --partition=private-dpnc-gpu,shared-gpu --time=12:00 :00 --mem=32GB --gres=gpu:1,VramPerGpu:20G
and get one of the gpus on gpu044.
But when running I get the same cuda error:

InstantiationException: Error in call to target 'run.evaluation.EvaluatePhysics':
RuntimeError('CUDA error: CUDA-capable device(s) is/are busy or unavailable\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n')
full_key: eval

It seems like some of the gpus on GPU044 works and some do not I guess in the picture below you can see which gpu i am on and test it yourself

Hello,

Running what? Please provides the details of exactly what you launch and how you launch it.

If you suspect that the allocated GPU is faulty, please run nvidia-smi --query-gpu=gpu_name,gpu_bus_id,vbios_version,serial,uuid --format=csv from the compute node to get the details of the allocated GPU and post the output here.

Hello @Malte.Algren and @Matthew.Leigh

I tried to run a burn test on all the GPUs of this node and the gpu #3 didn’t worked. I tried to reset it and it fails as well, the tool is asking to reboot the compute node. I’ve set the node to drain and once all the running jobs on it will finish we’ll reboot it and test it again.

Thanks for the notification.

Best

1 Like

If you do the burn test now is gpu #3 healthy?

Dear Matthew,

It seems the GPU is not healthy, I asked to our provider to replace the part.

Best regards,

GPU replaced under warranty by @Gael.Rossignol and node back in production.

Best