[bamboo] gpu006 nvidia driver in a bad state

Primary informations

Username: calvogon
Cluster: bamboo

Description

The NVIDIA driver is in a bad state.

It indicates xid error 95 (which is a memory corruption error).

Steps to Reproduce

dmesg showed Xid 95 uncontained errors.

Running cuInit(0)returns error code 999 (CUDA_ERROR_UNKNOWN)

I think the fix would be to either reboot the machine or to try to run nvidia-smi –gpu-reset.

On the other hand, what happened to gpu005 on bamboo? I can’t see it anymore.

Best,

Ramon.

Hello @Ramon.CalvoGonzalez

gpu006 has been put back into production and seems working again.

Regarding gpu005, this node is reserved for a central Unige project, for technical reason the node has been remove from slurm.

1 Like