RuntimeError: CUDA unknown error

Hello all,

I hope this message finds you well. I am currently facing some difficulties with getting PyTorch to work on GPU and I was wondering if anyone here could provide me with some guidance or assistance.

Here’s my sbatch script:

#!/bin/sh

#SBATCH --partition=private-teodoro-gpu
#SBATCH --time=0-00:15:00
#SBATCH --gpus=2
#SBATCH --cpus-per-task 4
#SBATCH --mem-per-cpu=16000

echo $CUDA_VISIBLE_DEVICES

module load Anaconda3/2022.05 CUDA/11.7.0

source activate nlu

echo $CUDA_VISIBLE_DEVICES

echo "python script"

srun python test.py

Here follows my test.py script:

import torch

print("GPUs in python script", list(range(torch.cuda.device_count())))

import os

command = "nvcc --version"
os.system(command)

command = "nvidia-smi"
os.system(command)

device = torch.device("cuda:0")

# Define the matrices
A = torch.randn(1000, 1000).to(device)
B = torch.randn(1000, 1000).to(device)

# Perform matrix multiplication on the GPU
C = torch.matmul(A, B)

# Move the result back to the CPU
C = C.cpu()

# Print the result
print(C)

Here follows the output:

0,1

0,1

python script

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

Wed Jun 14 14:22:48 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         On | 00000000:01:00.0 Off |                  N/A |
|  0%   27C    P8               26W / 370W|      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090         On | 00000000:21:00.0 Off |                  N/A |
|  0%   27C    P8               30W / 370W|      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

GPUs in python script [0, 1]

Traceback (most recent call last):
  File "/home/users/y/yazdani0/NLU4EHR_cosim_ft/test.py", line 17, in <module>
    A = torch.randn(1000, 1000).to(device)
  File "/home/users/y/yazdani0/.conda/envs/nlu/lib/python3.10/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

Thank you.

Hi, please give a try with apptainer: [tutorial] Using PyTorch2 with GPU on Baobab

Hi Yann,
I am facing the same issue. My last job array (id: 3702791) also had jobs that failed with the same error. I use a singularity image that has pytorch 2 support.

#!/bin/sh

#SBATCH --job-name=coco_sweep
#SBATCH --time=0-12:00:00
#SBATCH --partition=private-dpnc-gpu,shared-gpu
#SBATCH --output=/home/users/s/senguptd/UniGe/combinatorial/combinatorics/jobs/slurm-%A-%x_%a.out
#SBATCH --chdir=/home/users/s/senguptd/UniGe/combinatorial/combinatorics
#SBATCH --mem=20GB
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1
#SBATCH --exclude=gpu012
#SBATCH -a 0-499%20

export XDG_RUNTIME_DIR=""
export PYTHONPATH=${PWD}:${PWD}/python_install:${PYTHONPATH}

module load GCC/9.3.0 Singularity/3.7.3-GCC-9.3.0-Go-1.14

srun singularity exec --nv -B /home/users/s/senguptd/UniGe/combinatorial/combinatorics/,/srv/beegfs/scratch/groups/rodem/ttbar_evt_reco/topographs/:/top_data/,/srv/beegfs/scratch/users/s/senguptd/:/scratch/ /home/users/s/senguptd/UniGe/combinatorial/combinatorics/container/coco.sif wandb agent harvious/CoCo_Sweep_inclusive/wiozwtcy --count=1

Dear @Debajyoti.Sengupta

As announced, here, please do not use singularity from module anymore.

Please give a try with apptainer:

What CUDA toolkit do you have in your image?

With nvcc --version

Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

This happens on a certain subsection of nodes for me (gpu034, gpu007, gpu004)

The GPUs you are talking about have not much in common : not same generation, not same cpus. Weird.

(baobab)-[root@login2 ~]$ scontrol show node gpu[004,007,034] | grep gpu
NodeName=gpu004 Arch=x86_64 CoresPerSocket=10
   Gres=gpu:pascal:6,VramPerGpu:no_consume:12G
   NodeAddr=gpu004 NodeHostName=gpu004 Version=23.02.1
   Partitions=shared-gpu,private-kruse-gpu
   CfgTRES=cpu=20,mem=125G,billing=20,gres/gpu=6,gres/gpu:pascal=6
NodeName=gpu007 Arch=x86_64 CoresPerSocket=10
   Gres=gpu:pascal:4,VramPerGpu:no_consume:12G
   NodeAddr=gpu007 NodeHostName=gpu007 Version=23.02.1
   Partitions=shared-gpu,private-schaer-gpu
   CfgTRES=cpu=20,mem=257000M,billing=20,gres/gpu=4,gres/gpu:pascal=4
NodeName=gpu034 Arch=x86_64 CoresPerSocket=64
   Gres=gpu:ampere:8,VramPerGpu:no_consume:25G
   NodeAddr=gpu034 NodeHostName=gpu034 Version=23.02.1
   Partitions=shared-gpu,private-teodoro-gpu
   CfgTRES=cpu=128,mem=500G,billing=128,gres/gpu=8,gres/gpu:ampere=8
   AllocTRES=cpu=1,mem=100G,gres/gpu=2,gres/gpu:ampere=2

Are you using apptainer now?

So my submission script is still with singularity.


#SBATCH --job-name=coco_sweep
#SBATCH --time=0-10:00:00
#SBATCH --partition=private-dpnc-gpu,shared-gpu
#SBATCH --output=/home/users/s/senguptd/UniGe/combinatorial/combinatorics/jobs/slurm-%A-%x_%a.out
#SBATCH --chdir=/home/users/s/senguptd/UniGe/combinatorial/combinatorics
#SBATCH --mem=20GB
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1
#SBATCH --exclude=gpu012
#SBATCH -a 0-499

export XDG_RUNTIME_DIR=""
export PYTHONPATH=${PWD}:${PWD}/python_install:${PYTHONPATH}

# module load GCC/9.3.0 Singularity/3.7.3-GCC-9.3.0-Go-1.14

srun singularity exec --nv -B /home/users/s/senguptd/UniGe/combinatorial/combinatorics/,/srv/beegfs/scratch/groups/rodem/ttbar_evt_reco/topographs/:/top_data/,/srv/beegfs/scratch/users/s/senguptd/:/scratch/ /home/users/s/senguptd/UniGe/combinatorial/combinatorics/container/coco.sif wandb agent harvious/CoCo_Sweep_inclusive/omysqxh6 --count=1