[HPC][baobab] CUDA error (code 100, CUDA_ERROR_NO_DEVICE)

Ludovic.Dumoulin · February 3, 2023, 8:32am

Hello !

user = dumoulil

But an other user from my lab using the same bash (expect for the user dependent parts) has exactly the same error.

Since the maintenance, we randomly get error at the beginning of our simulations on GPU.

The error is:

ERROR: LoadError: CUDA error (code 100, CUDA_ERROR_NO_DEVICE)

I use this bash:

#!/bin/env bash
#SBATCH --array=1-360%50
#SBATCH --partition=private-kruse-gpu,shared-gpu
#SBATCH --time=0-02:30:00
#SBATCH --gpus=ampere:1
#SBATCH --constraint=DOUBLE_PRECISION_GPU
#SBATCH --output=%J.out
#SBATCH --mem=3000

module load Julia

cd /home/users/d/dumoulil/scratch/Data/PQ-series/polyMrho/
srun julia --optimize=3 /home/users/d/dumoulil/Code/Mrho_poly_PQ/2D.jl

Over my 360, only 90 of them worked.

As we use the same bash, the problem might be there. I don’t know if we have to update our bash ?

Thank you !

Ludovic Dumoulin

Adrien.Albert · February 3, 2023, 9:19am

Hi,

To be sure, this job is launched on Baobab ?

Could you give the slurm JobID ?

Ludovic.Dumoulin · February 3, 2023, 9:31am

I login in @login2.baobab.hpc.unige.ch

One of the JobID that crashed is 65202908
The first JobID (that worked) is 65201573

I also realized that (for me) error appear on gpu027 and 032

srun: error: gpu032: task 0: Exited with exit code 1

(a good 90% on gpu032)

Adrien.Albert · February 3, 2023, 10:59am

Hi,

Issue

There are some errors on these gpu[027,032-033]. I drained them.

(baobab)-[root@gpu033 ~]$ nvidia-smi
Fri Feb  3 11:44:07 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   37C    P0   118W / 300W |    509MiB / 81920MiB |     39%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:81:00.0 Off |                    0 |
| N/A   46C    P0   132W / 300W |    511MiB / 81920MiB |     34%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  ERR!                On   | 00000000:C1:00.0 Off |                 ERR! |
|ERR!  ERR! ERR!    ERR! / ERR! |      0MiB / 81920MiB |    ERR!      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     41379      C   ...021.3/gromacs/bin/gmx_mpi      506MiB |
|    1   N/A  N/A     44120      C   ...021.3/gromacs/bin/gmx_mpi      508MiB |
+-----------------------------------------------------------------------------+

root cause possible

Hardware/software Issue on these node
Jobs failing gpu card

Solution suggested

Try it with the newly installed version of Julia:
New software installed: julia version 1.8.5-linux-x86_64
Move your .julia, maybe you need to rebuild this workdir.
mv .julia{,old}

I test cuda with julia it’s working for me on another node:
(I had an empty .julia)

(baobab)-[alberta@gpu014 ~]$ cat test.julia 
# install the package
using Pkg
Pkg.add("CUDA")

# smoke test (this will download the CUDA toolkit)
using CUDA
CUDA.versioninfo()

(baobab)-[alberta@gpu014 ~]$ julia test.julia
julia test.julia
  Installing known registries into `~/.julia`
    Updating registry at `~/.julia/registries/General.toml`
   Resolving package versions...
   Installed GPUArraysCore ────────── v0.1.3
   Installed CUDA_Driver_jll ──────── v0.2.0+0
   Installed TimerOutputs ─────────── v0.5.22
   Installed IrrationalConstants ──── v0.1.1
   Installed Adapt ────────────────── v3.5.0
   Installed Preferences ──────────── v1.3.0
   Installed SpecialFunctions ─────── v2.1.7
   Installed GPUCompiler ──────────── v0.17.1
   Installed AbstractFFTs ─────────── v1.2.1
   Installed CUDA_Runtime_Discovery ─ v0.1.1
   Installed LLVMExtra_jll ────────── v0.0.16+0
   Installed CEnum ────────────────── v0.4.2
   Installed BFloat16s ────────────── v0.4.2
   Installed Random123 ────────────── v1.6.0
   Installed Reexport ─────────────── v1.2.2
   Installed JLLWrappers ──────────── v1.4.1
   Installed ChainRulesCore ───────── v1.15.7
   Installed CUDA_Runtime_jll ─────── v0.2.3+2
   Installed GPUArrays ────────────── v8.6.2
   Installed LogExpFunctions ──────── v0.3.20
   Installed Requires ─────────────── v1.3.0
   Installed OpenSpecFun_jll ──────── v0.5.5+0
   Installed ExprTools ────────────── v0.1.8
   Installed RandomNumbers ────────── v1.5.3
   Installed InverseFunctions ─────── v0.1.8
   Installed ChangesOfVariables ───── v0.1.5
   Installed DocStringExtensions ──── v0.9.3
   Installed LLVM ─────────────────── v4.15.0
   Installed CUDA ─────────────────── v4.0.0
  Downloaded artifact: LLVMExtra
  Downloaded artifact: OpenSpecFun
    Updating `~/.julia/environments/v1.7/Project.toml`
  [052768ef] + CUDA v4.0.0
    Updating `~/.julia/environments/v1.7/Manifest.toml`
  [621f4979] + AbstractFFTs v1.2.1
  [79e6a3ab] + Adapt v3.5.0
  [ab4f0b2a] + BFloat16s v0.4.2
  [fa961155] + CEnum v0.4.2
  [052768ef] + CUDA v4.0.0
  [1af6417a] + CUDA_Runtime_Discovery v0.1.1
  [d360d2e6] + ChainRulesCore v1.15.7
  [9e997f8a] + ChangesOfVariables v0.1.5
  [34da2185] + Compat v4.6.0
  [ffbed154] + DocStringExtensions v0.9.3
  [e2ba6199] + ExprTools v0.1.8
  [0c68f7d7] + GPUArrays v8.6.2
  [46192b85] + GPUArraysCore v0.1.3
  [61eb1bfa] + GPUCompiler v0.17.1
  [3587e190] + InverseFunctions v0.1.8
  [92d709cd] + IrrationalConstants v0.1.1
  [692b3bcd] + JLLWrappers v1.4.1
  [929cbde3] + LLVM v4.15.0
  [2ab3a3ac] + LogExpFunctions v0.3.20
  [21216c6a] + Preferences v1.3.0
  [74087812] + Random123 v1.6.0
  [e6cf234a] + RandomNumbers v1.5.3
  [189a3867] + Reexport v1.2.2
  [ae029012] + Requires v1.3.0
  [276daf66] + SpecialFunctions v2.1.7
  [a759f4b9] + TimerOutputs v0.5.22
  [4ee394cb] + CUDA_Driver_jll v0.2.0+0
  Downloaded artifact: CUDA_Driver
→ [76a88914] + CUDA_Runtime_jll v0.2.3+2
  [dad2f222] + LLVMExtra_jll v0.0.16+0
  [efe28fd5] + OpenSpecFun_jll v0.5.5+0
  [56f22d72] + Artifacts
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
  [f43a241f] + Downloads
  [b77e0a4c] + InteractiveUtils
  [4af54fe1] + LazyArtifacts
  [b27032c2] + LibCURL
  [76f85450] + LibGit2
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [ca575930] + NetworkOptions
  [44cfe95a] + Pkg
  [de0858da] + Printf
  [3fa0cd96] + REPL
  [9a3f8284] + Random
  [ea8e919c] + SHA
  [9e88b42a] + Serialization
  [6462fe0b] + Sockets
  [2f01184e] + SparseArrays
  [10745b16] + Statistics
  [fa267f1f] + TOML
  [a4e569a6] + Tar
  [8dfed614] + Test
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode
  [e66e0078] + CompilerSupportLibraries_jll
  [deac9b47] + LibCURL_jll
  [29816b5a] + LibSSH2_jll
  [c8ffd9c3] + MbedTLS_jll
  [14a3606d] + MozillaCACerts_jll
  [4536629a] + OpenBLAS_jll
  [05823500] + OpenLibm_jll
  [83775a58] + Zlib_jll
  [8e850b90] + libblastrampoline_jll
  [8e850ede] + nghttp2_jll
  [3f19e933] + p7zip_jll
        Info packages marked with → not downloaded, use `instantiate` to download
  Downloaded artifact: CUDA_Runtime
Precompiling project...
  34 dependencies successfully precompiled in 91 seconds (3 already precompiled)
CUDA runtime 11.8, artifact installation
CUDA driver 12.0
NVIDIA driver 525.60.13

Libraries: 
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 12.0.0+525.60.13

Toolchain:
- Julia: 1.7.2
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

7 devices:
  0: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
  1: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
  2: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
  3: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
  4: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
  5: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
  6: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)

Ludovic.Dumoulin · February 3, 2023, 1:11pm

Thank you for your quick reply.
After moving my .julia folder I tried

using Pkg
Pkg.add("CUDA")
using CUDA
CUDA.versioninfo()

I got this error message:

┌ Error: No CUDA Runtime library found. This can have several reasons:
│ * you are using an unsupported platform: CUDA.jl only supports Linux (x86_64, aarch64, ppc64le), and Windows (x86_64).
│   refer to the documentation for instructions on how to use a custom CUDA runtime.
│ * you precompiled CUDA.jl in an environment where the CUDA driver was not available.
│   in that case, you need to specify (during pre compilation) which version of CUDA to use.
│   refer to the documentation for instructions on how to use `CUDA.set_runtime_version!`.
│ * you requested use of a local CUDA toolkit, but not all components were discovered.
│   try running with JULIA_DEBUG=CUDA_Runtime_Discovery for more information.
└ @ CUDA ~/.julia/packages/CUDA/ZX8mg/src/initialization.jl:77
ERROR: LoadError: CUDA runtime not found

The JobID is 65215938

Ludovic.Dumoulin · February 3, 2023, 2:18pm

I tried again, I got this:

   Updating registry at `~/.julia/registries/General.toml`
   Resolving package versions...
    Updating `~/.julia/environments/v1.8/Project.toml`
  [052768ef] + CUDA v4.0.0
    Updating `~/.julia/environments/v1.8/Manifest.toml`
  [79e6a3ab] + Adapt v3.5.0
  [ab4f0b2a] + BFloat16s v0.4.2
  [fa961155] + CEnum v0.4.2
  [052768ef] + CUDA v4.0.0
  [1af6417a] + CUDA_Runtime_Discovery v0.1.1
  [e2ba6199] + ExprTools v0.1.8
  [0c68f7d7] + GPUArrays v8.6.2
  [46192b85] + GPUArraysCore v0.1.3
  [61eb1bfa] + GPUCompiler v0.17.1
  [929cbde3] + LLVM v4.15.0
  [74087812] + Random123 v1.6.0
  [e6cf234a] + RandomNumbers v1.5.3
  [a759f4b9] + TimerOutputs v0.5.22
 ⌅ [4ee394cb] + CUDA_Driver_jll v0.2.0+0
→⌅ [76a88914] + CUDA_Runtime_jll v0.2.3+2
  [dad2f222] + LLVMExtra_jll v0.0.16+0
        Info Packages marked with → are not downloaded, use `instantiate` to download
        Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated -m`
┌ Error: No CUDA Runtime library found. This can have several reasons:
│ * you are using an unsupported platform: CUDA.jl only supports Linux (x86_64, aarch64, ppc64le), and Windows (x86_64).
│   refer to the documentation for instructions on how to use a custom CUDA runtime.
│ * you precompiled CUDA.jl in an environment where the CUDA driver was not available.
│   in that case, you need to specify (during pre compilation) which version of CUDA to use.
│   refer to the documentation for instructions on how to use `CUDA.set_runtime_version!`.
│ * you requested use of a local CUDA toolkit, but not all components were discovered.
│   try running with JULIA_DEBUG=CUDA_Runtime_Discovery for more information.
└ @ CUDA ~/.julia/packages/CUDA/ZX8mg/src/initialization.jl:77
ERROR: LoadError: CUDA runtime not found
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] functional
   @ ~/.julia/packages/CUDA/ZX8mg/src/initialization.jl:24 [inlined]
 [3] versioninfo(io::Base.PipeEndpoint) (repeats 2 times)
   @ CUDA ~/.julia/packages/CUDA/ZX8mg/src/utilities.jl:32
 [4] top-level scope
   @ ~/Code/InstallCUDA/main.jl:4
in expression starting at /home/users/d/dumoulil/Code/InstallCUDA/main.jl:4
srun: error: gpu022: task 0: Exited with exit code 1

It is the first time I get this error, usually the

using Pkg
Pkg.add("CUDA")
using CUDA
CUDA.versioninfo()

works well.

Ludovic.Dumoulin · February 3, 2023, 4:35pm

The problem comes with DOUBLE_PRECISION_GPU constraint.
After multiple tries, I was able to install CUDA on an other node without the constraint.
1 device:
0: NVIDIA GeForce RTX 3080 (sm_86, 9.771 GiB / 10.000 GiB available)

Adrien.Albert · February 6, 2023, 11:27am

Hi @Ludovic.Dumoulin

Just to know, why not use cuda with the module?

 (baobab)-[alberta@cpu025 easybuild]$ ml spider cuda

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  CUDA:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Description:
      CUDA (formerly Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs) that they produce.
      CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs.

     Versions:
        CUDA/8.0.44
        CUDA/8.0.61
        CUDA/9.1.85
        CUDA/9.2.88
        CUDA/9.2.148.1
        CUDA/10.0.130
        CUDA/10.1.105
        CUDA/10.1.243
        CUDA/11.0.2
        CUDA/11.1.1
        CUDA/11.3.1
        CUDA/11.5.0
        CUDA/11.5.1
        CUDA/11.6.0
        CUDA/11.7.0
        CUDA/12.0.0

Ludovic.Dumoulin · February 6, 2023, 11:57am

Hi @Adrien.Albert ,

I don’t know. I just followed the readme for CUDA.jl
I can use the same code on every computer, and I don’t need to install the whole spider/anaconda thing.
If you think that it would be better to use CUDA with module, I can check how it is working.

I don’t understand why everything was working fine before the update and not now.
I use Julia 1.8.4 on my computer with an old Quadro K620 or with more recent A30 (that is approximatly an half of a A100) and everything works well.
Then I don’t know if it is a problem of Julia 1.8.5 or a problem with respect to the cluster.
I am surprised to have problem only with ampere and double precision GPUs (so A100) on the cluster.

I am updating to v1.8.5 now to see.

Thank you,

Ludovic.Dumoulin · February 6, 2023, 12:07pm

It works well with julia 1.8.5 on my computer

Adrien.Albert · February 6, 2023, 1:39pm

Hi @Ludovic.Dumoulin ,

By any chance, did your jobs previously use (only) single precision nodes?

Adrien.Albert · February 6, 2023, 2:05pm

Hi @Ludovic.Dumoulin ,

It’s working for me on the gpu033 (A100):

(baobab)-[alberta@login2 ~]$ srun --reservation=test_gpu --gpus=1 --partition=shared-gpu julia test.julia
    Updating registry at `~/.julia/registries/General.toml`
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.8/Project.toml`
  No Changes to `~/.julia/environments/v1.8/Manifest.toml`
CUDA runtime 11.8, artifact installation
CUDA driver 12.0
NVIDIA driver 525.60.13

Libraries: 
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 12.0.0+525.60.13

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA A100 80GB PCIe (sm_80, 79.182 GiB / 80.000 GiB available)

Ludovic.Dumoulin · February 7, 2023, 1:36pm

Hello !

It seems that everything is working fine now (no problem over my 20 first simulations).
I’ll see with the other members of the lab.

Thank you for your help !