Primary informations
Username: calvogon
Cluster: bamboo
Description
The NVIDIA driver is in a bad state.
It indicates xid error 95 (which is a memory corruption error).
Steps to Reproduce
dmesg showed Xid 95 uncontained errors.
Running cuInit(0)returns error code 999 (CUDA_ERROR_UNKNOWN)
I think the fix would be to either reboot the machine or to try to run nvidia-smi –gpu-reset.
On the other hand, what happened to gpu005 on bamboo? I can’t see it anymore.
Best,
Ramon.