Hello !
user = dumoulil
But an other user from my lab using the same bash (expect for the user dependent parts) has exactly the same error.
Since the maintenance, we randomly get error at the beginning of our simulations on GPU.
The error is:
ERROR: LoadError: CUDA error (code 100, CUDA_ERROR_NO_DEVICE)
I use this bash:
#!/bin/env bash
#SBATCH --array=1-360%50
#SBATCH --partition=private-kruse-gpu,shared-gpu
#SBATCH --time=0-02:30:00
#SBATCH --gpus=ampere:1
#SBATCH --constraint=DOUBLE_PRECISION_GPU
#SBATCH --output=%J.out
#SBATCH --mem=3000
module load Julia
cd /home/users/d/dumoulil/scratch/Data/PQ-series/polyMrho/
srun julia --optimize=3 /home/users/d/dumoulil/Code/Mrho_poly_PQ/2D.jl
Over my 360, only 90 of them worked.
As we use the same bash, the problem might be there. I donβt know if we have to update our bash ?
Thank you !
Ludovic Dumoulin
Hi,
To be sure, this job is launched on Baobab ?
Could you give the slurm JobID ?
I login in @login2.baobab.hpc.unige.ch
One of the JobID that crashed is 65202908
The first JobID (that worked) is 65201573
I also realized that (for me) error appear on gpu027 and 032
srun: error: gpu032: task 0: Exited with exit code 1
(a good 90% on gpu032)
Hi,
Issue
There are some errors on these gpu[027,032-033]. I drained them.
(baobab)-[root@gpu033 ~]$ nvidia-smi
Fri Feb 3 11:44:07 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... On | 00000000:41:00.0 Off | 0 |
| N/A 37C P0 118W / 300W | 509MiB / 81920MiB | 39% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... On | 00000000:81:00.0 Off | 0 |
| N/A 46C P0 132W / 300W | 511MiB / 81920MiB | 34% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 ERR! On | 00000000:C1:00.0 Off | ERR! |
|ERR! ERR! ERR! ERR! / ERR! | 0MiB / 81920MiB | ERR! Default |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 41379 C ...021.3/gromacs/bin/gmx_mpi 506MiB |
| 1 N/A N/A 44120 C ...021.3/gromacs/bin/gmx_mpi 508MiB |
+-----------------------------------------------------------------------------+
root cause possible
- Hardware/software Issue on these node
- Jobs failing gpu card
Solution suggested
I test cuda with julia itβs working for me on another node:
(I had an empty .julia)
(baobab)-[alberta@gpu014 ~]$ cat test.julia
# install the package
using Pkg
Pkg.add("CUDA")
# smoke test (this will download the CUDA toolkit)
using CUDA
CUDA.versioninfo()
(baobab)-[alberta@gpu014 ~]$ julia test.julia
julia test.julia
Installing known registries into `~/.julia`
Updating registry at `~/.julia/registries/General.toml`
Resolving package versions...
Installed GPUArraysCore ββββββββββ v0.1.3
Installed CUDA_Driver_jll ββββββββ v0.2.0+0
Installed TimerOutputs βββββββββββ v0.5.22
Installed IrrationalConstants ββββ v0.1.1
Installed Adapt ββββββββββββββββββ v3.5.0
Installed Preferences ββββββββββββ v1.3.0
Installed SpecialFunctions βββββββ v2.1.7
Installed GPUCompiler ββββββββββββ v0.17.1
Installed AbstractFFTs βββββββββββ v1.2.1
Installed CUDA_Runtime_Discovery β v0.1.1
Installed LLVMExtra_jll ββββββββββ v0.0.16+0
Installed CEnum ββββββββββββββββββ v0.4.2
Installed BFloat16s ββββββββββββββ v0.4.2
Installed Random123 ββββββββββββββ v1.6.0
Installed Reexport βββββββββββββββ v1.2.2
Installed JLLWrappers ββββββββββββ v1.4.1
Installed ChainRulesCore βββββββββ v1.15.7
Installed CUDA_Runtime_jll βββββββ v0.2.3+2
Installed GPUArrays ββββββββββββββ v8.6.2
Installed LogExpFunctions ββββββββ v0.3.20
Installed Requires βββββββββββββββ v1.3.0
Installed OpenSpecFun_jll ββββββββ v0.5.5+0
Installed ExprTools ββββββββββββββ v0.1.8
Installed RandomNumbers ββββββββββ v1.5.3
Installed InverseFunctions βββββββ v0.1.8
Installed ChangesOfVariables βββββ v0.1.5
Installed DocStringExtensions ββββ v0.9.3
Installed LLVM βββββββββββββββββββ v4.15.0
Installed CUDA βββββββββββββββββββ v4.0.0
Downloaded artifact: LLVMExtra
Downloaded artifact: OpenSpecFun
Updating `~/.julia/environments/v1.7/Project.toml`
[052768ef] + CUDA v4.0.0
Updating `~/.julia/environments/v1.7/Manifest.toml`
[621f4979] + AbstractFFTs v1.2.1
[79e6a3ab] + Adapt v3.5.0
[ab4f0b2a] + BFloat16s v0.4.2
[fa961155] + CEnum v0.4.2
[052768ef] + CUDA v4.0.0
[1af6417a] + CUDA_Runtime_Discovery v0.1.1
[d360d2e6] + ChainRulesCore v1.15.7
[9e997f8a] + ChangesOfVariables v0.1.5
[34da2185] + Compat v4.6.0
[ffbed154] + DocStringExtensions v0.9.3
[e2ba6199] + ExprTools v0.1.8
[0c68f7d7] + GPUArrays v8.6.2
[46192b85] + GPUArraysCore v0.1.3
[61eb1bfa] + GPUCompiler v0.17.1
[3587e190] + InverseFunctions v0.1.8
[92d709cd] + IrrationalConstants v0.1.1
[692b3bcd] + JLLWrappers v1.4.1
[929cbde3] + LLVM v4.15.0
[2ab3a3ac] + LogExpFunctions v0.3.20
[21216c6a] + Preferences v1.3.0
[74087812] + Random123 v1.6.0
[e6cf234a] + RandomNumbers v1.5.3
[189a3867] + Reexport v1.2.2
[ae029012] + Requires v1.3.0
[276daf66] + SpecialFunctions v2.1.7
[a759f4b9] + TimerOutputs v0.5.22
[4ee394cb] + CUDA_Driver_jll v0.2.0+0
Downloaded artifact: CUDA_Driver
β [76a88914] + CUDA_Runtime_jll v0.2.3+2
[dad2f222] + LLVMExtra_jll v0.0.16+0
[efe28fd5] + OpenSpecFun_jll v0.5.5+0
[56f22d72] + Artifacts
[2a0f44e3] + Base64
[ade2ca70] + Dates
[f43a241f] + Downloads
[b77e0a4c] + InteractiveUtils
[4af54fe1] + LazyArtifacts
[b27032c2] + LibCURL
[76f85450] + LibGit2
[8f399da3] + Libdl
[37e2e46d] + LinearAlgebra
[56ddb016] + Logging
[d6f4376e] + Markdown
[ca575930] + NetworkOptions
[44cfe95a] + Pkg
[de0858da] + Printf
[3fa0cd96] + REPL
[9a3f8284] + Random
[ea8e919c] + SHA
[9e88b42a] + Serialization
[6462fe0b] + Sockets
[2f01184e] + SparseArrays
[10745b16] + Statistics
[fa267f1f] + TOML
[a4e569a6] + Tar
[8dfed614] + Test
[cf7118a7] + UUIDs
[4ec0a83e] + Unicode
[e66e0078] + CompilerSupportLibraries_jll
[deac9b47] + LibCURL_jll
[29816b5a] + LibSSH2_jll
[c8ffd9c3] + MbedTLS_jll
[14a3606d] + MozillaCACerts_jll
[4536629a] + OpenBLAS_jll
[05823500] + OpenLibm_jll
[83775a58] + Zlib_jll
[8e850b90] + libblastrampoline_jll
[8e850ede] + nghttp2_jll
[3f19e933] + p7zip_jll
Info packages marked with β not downloaded, use `instantiate` to download
Downloaded artifact: CUDA_Runtime
Precompiling project...
34 dependencies successfully precompiled in 91 seconds (3 already precompiled)
CUDA runtime 11.8, artifact installation
CUDA driver 12.0
NVIDIA driver 525.60.13
Libraries:
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 12.0.0+525.60.13
Toolchain:
- Julia: 1.7.2
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
7 devices:
0: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
1: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
2: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
3: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
4: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
5: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
6: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
1 Like
Thank you for your quick reply.
After moving my .julia folder I tried
using Pkg
Pkg.add("CUDA")
using CUDA
CUDA.versioninfo()
I got this error message:
β Error: No CUDA Runtime library found. This can have several reasons:
β * you are using an unsupported platform: CUDA.jl only supports Linux (x86_64, aarch64, ppc64le), and Windows (x86_64).
β refer to the documentation for instructions on how to use a custom CUDA runtime.
β * you precompiled CUDA.jl in an environment where the CUDA driver was not available.
β in that case, you need to specify (during pre compilation) which version of CUDA to use.
β refer to the documentation for instructions on how to use `CUDA.set_runtime_version!`.
β * you requested use of a local CUDA toolkit, but not all components were discovered.
β try running with JULIA_DEBUG=CUDA_Runtime_Discovery for more information.
β @ CUDA ~/.julia/packages/CUDA/ZX8mg/src/initialization.jl:77
ERROR: LoadError: CUDA runtime not found
The JobID is 65215938
I tried again, I got this:
Updating registry at `~/.julia/registries/General.toml`
Resolving package versions...
Updating `~/.julia/environments/v1.8/Project.toml`
[052768ef] + CUDA v4.0.0
Updating `~/.julia/environments/v1.8/Manifest.toml`
[79e6a3ab] + Adapt v3.5.0
[ab4f0b2a] + BFloat16s v0.4.2
[fa961155] + CEnum v0.4.2
[052768ef] + CUDA v4.0.0
[1af6417a] + CUDA_Runtime_Discovery v0.1.1
[e2ba6199] + ExprTools v0.1.8
[0c68f7d7] + GPUArrays v8.6.2
[46192b85] + GPUArraysCore v0.1.3
[61eb1bfa] + GPUCompiler v0.17.1
[929cbde3] + LLVM v4.15.0
[74087812] + Random123 v1.6.0
[e6cf234a] + RandomNumbers v1.5.3
[a759f4b9] + TimerOutputs v0.5.22
β
[4ee394cb] + CUDA_Driver_jll v0.2.0+0
ββ
[76a88914] + CUDA_Runtime_jll v0.2.3+2
[dad2f222] + LLVMExtra_jll v0.0.16+0
Info Packages marked with β are not downloaded, use `instantiate` to download
Info Packages marked with β
have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated -m`
β Error: No CUDA Runtime library found. This can have several reasons:
β * you are using an unsupported platform: CUDA.jl only supports Linux (x86_64, aarch64, ppc64le), and Windows (x86_64).
β refer to the documentation for instructions on how to use a custom CUDA runtime.
β * you precompiled CUDA.jl in an environment where the CUDA driver was not available.
β in that case, you need to specify (during pre compilation) which version of CUDA to use.
β refer to the documentation for instructions on how to use `CUDA.set_runtime_version!`.
β * you requested use of a local CUDA toolkit, but not all components were discovered.
β try running with JULIA_DEBUG=CUDA_Runtime_Discovery for more information.
β @ CUDA ~/.julia/packages/CUDA/ZX8mg/src/initialization.jl:77
ERROR: LoadError: CUDA runtime not found
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] functional
@ ~/.julia/packages/CUDA/ZX8mg/src/initialization.jl:24 [inlined]
[3] versioninfo(io::Base.PipeEndpoint) (repeats 2 times)
@ CUDA ~/.julia/packages/CUDA/ZX8mg/src/utilities.jl:32
[4] top-level scope
@ ~/Code/InstallCUDA/main.jl:4
in expression starting at /home/users/d/dumoulil/Code/InstallCUDA/main.jl:4
srun: error: gpu022: task 0: Exited with exit code 1
It is the first time I get this error, usually the
using Pkg
Pkg.add("CUDA")
using CUDA
CUDA.versioninfo()
works well.
The problem comes with DOUBLE_PRECISION_GPU
constraint.
After multiple tries, I was able to install CUDA on an other node without the constraint.
1 device:
0: NVIDIA GeForce RTX 3080 (sm_86, 9.771 GiB / 10.000 GiB available)
Hi @Ludovic.Dumoulin
Just to know, why not use cuda with the module?
(baobab)-[alberta@cpu025 easybuild]$ ml spider cuda
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CUDA:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Description:
CUDA (formerly Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs) that they produce.
CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs.
Versions:
CUDA/8.0.44
CUDA/8.0.61
CUDA/9.1.85
CUDA/9.2.88
CUDA/9.2.148.1
CUDA/10.0.130
CUDA/10.1.105
CUDA/10.1.243
CUDA/11.0.2
CUDA/11.1.1
CUDA/11.3.1
CUDA/11.5.0
CUDA/11.5.1
CUDA/11.6.0
CUDA/11.7.0
CUDA/12.0.0
Hi @Adrien.Albert ,
I donβt know. I just followed the readme for CUDA.jl
I can use the same code on every computer, and I donβt need to install the whole spider/anaconda thing.
If you think that it would be better to use CUDA with module, I can check how it is working.
I donβt understand why everything was working fine before the update and not now.
I use Julia 1.8.4 on my computer with an old Quadro K620 or with more recent A30 (that is approximatly an half of a A100) and everything works well.
Then I donβt know if it is a problem of Julia 1.8.5 or a problem with respect to the cluster.
I am surprised to have problem only with ampere and double precision GPUs (so A100) on the cluster.
I am updating to v1.8.5 now to see.
Thank you,
It works well with julia 1.8.5 on my computer
Hi @Ludovic.Dumoulin ,
By any chance, did your jobs previously use (only) single precision nodes?
Hi @Ludovic.Dumoulin ,
Itβs working for me on the gpu033 (A100):
(baobab)-[alberta@login2 ~]$ srun --reservation=test_gpu --gpus=1 --partition=shared-gpu julia test.julia
Updating registry at `~/.julia/registries/General.toml`
Resolving package versions...
No Changes to `~/.julia/environments/v1.8/Project.toml`
No Changes to `~/.julia/environments/v1.8/Manifest.toml`
CUDA runtime 11.8, artifact installation
CUDA driver 12.0
NVIDIA driver 525.60.13
Libraries:
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 12.0.0+525.60.13
Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86
1 device:
0: NVIDIA A100 80GB PCIe (sm_80, 79.182 GiB / 80.000 GiB available)
Hello !
It seems that everything is working fine now (no problem over my 20 first simulations).
Iβll see with the other members of the lab.
Thank you for your help !
1 Like