[HPC][baobab] CUDA error (code 100, CUDA_ERROR_NO_DEVICE)

Hi,

Issue

There are some errors on these gpu[027,032-033]. I drained them.

(baobab)-[root@gpu033 ~]$ nvidia-smi
Fri Feb  3 11:44:07 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   37C    P0   118W / 300W |    509MiB / 81920MiB |     39%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:81:00.0 Off |                    0 |
| N/A   46C    P0   132W / 300W |    511MiB / 81920MiB |     34%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  ERR!                On   | 00000000:C1:00.0 Off |                 ERR! |
|ERR!  ERR! ERR!    ERR! / ERR! |      0MiB / 81920MiB |    ERR!      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     41379      C   ...021.3/gromacs/bin/gmx_mpi      506MiB |
|    1   N/A  N/A     44120      C   ...021.3/gromacs/bin/gmx_mpi      508MiB |
+-----------------------------------------------------------------------------+

root cause possible

  • Hardware/software Issue on these node
  • Jobs failing gpu card

Solution suggested

I test cuda with julia it’s working for me on another node:
(I had an empty .julia)

(baobab)-[alberta@gpu014 ~]$ cat test.julia 
# install the package
using Pkg
Pkg.add("CUDA")

# smoke test (this will download the CUDA toolkit)
using CUDA
CUDA.versioninfo()

(baobab)-[alberta@gpu014 ~]$ julia test.julia
julia test.julia
  Installing known registries into `~/.julia`
    Updating registry at `~/.julia/registries/General.toml`
   Resolving package versions...
   Installed GPUArraysCore ────────── v0.1.3
   Installed CUDA_Driver_jll ──────── v0.2.0+0
   Installed TimerOutputs ─────────── v0.5.22
   Installed IrrationalConstants ──── v0.1.1
   Installed Adapt ────────────────── v3.5.0
   Installed Preferences ──────────── v1.3.0
   Installed SpecialFunctions ─────── v2.1.7
   Installed GPUCompiler ──────────── v0.17.1
   Installed AbstractFFTs ─────────── v1.2.1
   Installed CUDA_Runtime_Discovery ─ v0.1.1
   Installed LLVMExtra_jll ────────── v0.0.16+0
   Installed CEnum ────────────────── v0.4.2
   Installed BFloat16s ────────────── v0.4.2
   Installed Random123 ────────────── v1.6.0
   Installed Reexport ─────────────── v1.2.2
   Installed JLLWrappers ──────────── v1.4.1
   Installed ChainRulesCore ───────── v1.15.7
   Installed CUDA_Runtime_jll ─────── v0.2.3+2
   Installed GPUArrays ────────────── v8.6.2
   Installed LogExpFunctions ──────── v0.3.20
   Installed Requires ─────────────── v1.3.0
   Installed OpenSpecFun_jll ──────── v0.5.5+0
   Installed ExprTools ────────────── v0.1.8
   Installed RandomNumbers ────────── v1.5.3
   Installed InverseFunctions ─────── v0.1.8
   Installed ChangesOfVariables ───── v0.1.5
   Installed DocStringExtensions ──── v0.9.3
   Installed LLVM ─────────────────── v4.15.0
   Installed CUDA ─────────────────── v4.0.0
  Downloaded artifact: LLVMExtra
  Downloaded artifact: OpenSpecFun
    Updating `~/.julia/environments/v1.7/Project.toml`
  [052768ef] + CUDA v4.0.0
    Updating `~/.julia/environments/v1.7/Manifest.toml`
  [621f4979] + AbstractFFTs v1.2.1
  [79e6a3ab] + Adapt v3.5.0
  [ab4f0b2a] + BFloat16s v0.4.2
  [fa961155] + CEnum v0.4.2
  [052768ef] + CUDA v4.0.0
  [1af6417a] + CUDA_Runtime_Discovery v0.1.1
  [d360d2e6] + ChainRulesCore v1.15.7
  [9e997f8a] + ChangesOfVariables v0.1.5
  [34da2185] + Compat v4.6.0
  [ffbed154] + DocStringExtensions v0.9.3
  [e2ba6199] + ExprTools v0.1.8
  [0c68f7d7] + GPUArrays v8.6.2
  [46192b85] + GPUArraysCore v0.1.3
  [61eb1bfa] + GPUCompiler v0.17.1
  [3587e190] + InverseFunctions v0.1.8
  [92d709cd] + IrrationalConstants v0.1.1
  [692b3bcd] + JLLWrappers v1.4.1
  [929cbde3] + LLVM v4.15.0
  [2ab3a3ac] + LogExpFunctions v0.3.20
  [21216c6a] + Preferences v1.3.0
  [74087812] + Random123 v1.6.0
  [e6cf234a] + RandomNumbers v1.5.3
  [189a3867] + Reexport v1.2.2
  [ae029012] + Requires v1.3.0
  [276daf66] + SpecialFunctions v2.1.7
  [a759f4b9] + TimerOutputs v0.5.22
  [4ee394cb] + CUDA_Driver_jll v0.2.0+0
  Downloaded artifact: CUDA_Driver
→ [76a88914] + CUDA_Runtime_jll v0.2.3+2
  [dad2f222] + LLVMExtra_jll v0.0.16+0
  [efe28fd5] + OpenSpecFun_jll v0.5.5+0
  [56f22d72] + Artifacts
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
  [f43a241f] + Downloads
  [b77e0a4c] + InteractiveUtils
  [4af54fe1] + LazyArtifacts
  [b27032c2] + LibCURL
  [76f85450] + LibGit2
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [ca575930] + NetworkOptions
  [44cfe95a] + Pkg
  [de0858da] + Printf
  [3fa0cd96] + REPL
  [9a3f8284] + Random
  [ea8e919c] + SHA
  [9e88b42a] + Serialization
  [6462fe0b] + Sockets
  [2f01184e] + SparseArrays
  [10745b16] + Statistics
  [fa267f1f] + TOML
  [a4e569a6] + Tar
  [8dfed614] + Test
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode
  [e66e0078] + CompilerSupportLibraries_jll
  [deac9b47] + LibCURL_jll
  [29816b5a] + LibSSH2_jll
  [c8ffd9c3] + MbedTLS_jll
  [14a3606d] + MozillaCACerts_jll
  [4536629a] + OpenBLAS_jll
  [05823500] + OpenLibm_jll
  [83775a58] + Zlib_jll
  [8e850b90] + libblastrampoline_jll
  [8e850ede] + nghttp2_jll
  [3f19e933] + p7zip_jll
        Info packages marked with → not downloaded, use `instantiate` to download
  Downloaded artifact: CUDA_Runtime
Precompiling project...
  34 dependencies successfully precompiled in 91 seconds (3 already precompiled)
CUDA runtime 11.8, artifact installation
CUDA driver 12.0
NVIDIA driver 525.60.13

Libraries: 
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 12.0.0+525.60.13

Toolchain:
- Julia: 1.7.2
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

7 devices:
  0: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
  1: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
  2: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
  3: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
  4: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
  5: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
  6: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
1 Like