GPU044 CUDA Unavailable

Hello all,

I have been noticing a strange glitch on gpu044 this morning, whereby I am unable to get CUDA to work with my training scripts.

Here’s my (minimal example) sbatch script:

#!/bin/sh
#SBATCH --job-name=curtains
#SBATCH --cpus-per-task=2
#SBATCH --time=00-10:00:00
#SBATCH --partition=private-dpnc-gpu,shared-gpu
#SBATCH --output=/home/users/s/senguptd/UniGe/Anomaly/curtains/curtains2/jobs/slurm-%A-%x_%a.out
#SBATCH --chdir=/home/users/s/senguptd/UniGe/Anomaly/curtains/curtains2
#SBATCH --mem=10GB
#SBATCH --gres=gpu:1
#SBATCH -a 0-11

export XDG_RUNTIME_DIR=""
module load GCC/9.3.0 Singularity/3.7.3-Go-1.14
export PYTHONPATH=${PWD}:${PWD}/python_install:${PYTHONPATH}

srun singularity exec --nv -B /home/users/s/senguptd/UniGe/Anomaly/curtains/curtains2,/home/users/s/senguptd/UniGe/Anomaly/curtains/curtains2/curtains,/srv/beegfs/scratch/users/s/senguptd/:/scratch/,/srv/beegfs/scratch/groups/rodem/LHCO:/lhco_dir,/srv/beegfs/scratch/groups/rodem/anomalous_jets/:/srv/beegfs/scratch/groups/rodem/anomalous_jets/,/srv/beegfs/scratch/groups/rodem/RPVMultiJets/:/rpvmj_dir /home/users/s/senguptd/UniGe/Anomaly/curtains/curtains2/container/id_init.sif\
	python3 /home/users/s/senguptd/UniGe/Anomaly/curtains/curtains2/experiments/sb_to_sb.py

Disclaimer: I have tried using #SBATCH gpus=1 with similar results on gpu044.

This is the output I get.

/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:107: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1678411187366/work/c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0

Jobs launched through the same script and running on other nodes are running as usual.
Is there anyone else who’s facing a similar issue?
Current jobs running on the node: 829513_0-7

Regards,
Deb

Hi @Debajyoti.Sengupta,

Did you tried to load the module cuda ?

(baobab)-[alberta@login2 ~]$ ml spider cuda

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  CUDA:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Description:
      CUDA (formerly Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs) that they produce.
      CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs.

     Versions:
        CUDA/8.0.44
        CUDA/8.0.61
        CUDA/9.1.85
        CUDA/9.2.88
        CUDA/9.2.148.1
        CUDA/10.0.130
        [...]

Best Regards

Hi Adrien,

I just relaunched the jobs with a modified slurm script with
module laod CUDA/10.0.130
with no change in the behaviour. As I said this seems to work for all other nodes than gpu044.

Cheers,
Deb

Here’s what nvidia-smi shows when I ssh onto gpu044.

I have about 7 jobs running on the node

Hi @Debajyoti.Sengupta

gpu044 has RTX5000 cards. As explained here the minimal CUDA version is 11.
Best
Yann

1 Like