Error when running GPU jobs on gpu002 (GPU #3 seems down)

Primary informations

Username: rubinor
Cluster: baobab
Node: gpu002

Description

Error when running GPU jobs on gpu002.
Error messages:
– Python

RuntimeError: No CUDA GPUs are available

– When the job starts, quick ssh to gpu002 and run nvidia-smi

No devices were found

Current usage of gpu002:

CfgTRES=cpu=10,mem=257000M,billing=10,gres/gpu=6,gres/gpu:titan=6
AllocTRES=cpu=8,mem=94G,gres/gpu=3,gres/gpu:titan=3

Steps to Reproduce

The error only happens if gpu002 is already in use. To reproduce the error, run the command below n times until all GPUs are allocated.

Run some command on gpu002 requesting 1 GPU and 1 CPU, for instance:

sbatch run.sh

with “run.sh” containing:

#!/bin/bash
#SBATCH --partition=shared-gpu
#SBATCH --job-name=jobname
#SBATCH --time=00:10:00
#SBATCH --error=slurm.e%j 
#SBATCH --output=slurm.o%j
#SBATCH --gres=gpu:1
#SBATCH --nodelist=gpu002

nvidia-smi

Expected Result

When running the above steps, you should get an output file containing the usual “nvidia-smi” output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA TITAN X (Pascal)        On  |   00000000:83:00.0 Off |                  N/A |
| 23%   30C    P8              8W /  250W |       2MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Actual Result

The content of the slurm output file is:

No devices were found

Edit it seems that GPU 3 is down on gpu002 (GPU ids start at 0)

Dear @Raphael.Rubino many thanks for the notification, it is a pity we didn’t saw it earlier, you know, vacations etc. I’ve drained the node and we’ll fix that.

Best

Yann

1 Like