Hi,
Issue
There are some errors on these gpu[027,032-033]. I drained them.
(baobab)-[root@gpu033 ~]$ nvidia-smi
Fri Feb 3 11:44:07 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... On | 00000000:41:00.0 Off | 0 |
| N/A 37C P0 118W / 300W | 509MiB / 81920MiB | 39% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... On | 00000000:81:00.0 Off | 0 |
| N/A 46C P0 132W / 300W | 511MiB / 81920MiB | 34% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 ERR! On | 00000000:C1:00.0 Off | ERR! |
|ERR! ERR! ERR! ERR! / ERR! | 0MiB / 81920MiB | ERR! Default |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 41379 C ...021.3/gromacs/bin/gmx_mpi 506MiB |
| 1 N/A N/A 44120 C ...021.3/gromacs/bin/gmx_mpi 508MiB |
+-----------------------------------------------------------------------------+
root cause possible
- Hardware/software Issue on these node
- Jobs failing gpu card
Solution suggested
- Try it with the newly installed version of Julia:
New software installed: julia version 1.8.5-linux-x86_64 - Move your .julia, maybe you need to rebuild this workdir.
mv .julia{,old}
I test cuda with julia it’s working for me on another node:
(I had an empty .julia)
(baobab)-[alberta@gpu014 ~]$ cat test.julia
# install the package
using Pkg
Pkg.add("CUDA")
# smoke test (this will download the CUDA toolkit)
using CUDA
CUDA.versioninfo()
(baobab)-[alberta@gpu014 ~]$ julia test.julia
julia test.julia
Installing known registries into `~/.julia`
Updating registry at `~/.julia/registries/General.toml`
Resolving package versions...
Installed GPUArraysCore ────────── v0.1.3
Installed CUDA_Driver_jll ──────── v0.2.0+0
Installed TimerOutputs ─────────── v0.5.22
Installed IrrationalConstants ──── v0.1.1
Installed Adapt ────────────────── v3.5.0
Installed Preferences ──────────── v1.3.0
Installed SpecialFunctions ─────── v2.1.7
Installed GPUCompiler ──────────── v0.17.1
Installed AbstractFFTs ─────────── v1.2.1
Installed CUDA_Runtime_Discovery ─ v0.1.1
Installed LLVMExtra_jll ────────── v0.0.16+0
Installed CEnum ────────────────── v0.4.2
Installed BFloat16s ────────────── v0.4.2
Installed Random123 ────────────── v1.6.0
Installed Reexport ─────────────── v1.2.2
Installed JLLWrappers ──────────── v1.4.1
Installed ChainRulesCore ───────── v1.15.7
Installed CUDA_Runtime_jll ─────── v0.2.3+2
Installed GPUArrays ────────────── v8.6.2
Installed LogExpFunctions ──────── v0.3.20
Installed Requires ─────────────── v1.3.0
Installed OpenSpecFun_jll ──────── v0.5.5+0
Installed ExprTools ────────────── v0.1.8
Installed RandomNumbers ────────── v1.5.3
Installed InverseFunctions ─────── v0.1.8
Installed ChangesOfVariables ───── v0.1.5
Installed DocStringExtensions ──── v0.9.3
Installed LLVM ─────────────────── v4.15.0
Installed CUDA ─────────────────── v4.0.0
Downloaded artifact: LLVMExtra
Downloaded artifact: OpenSpecFun
Updating `~/.julia/environments/v1.7/Project.toml`
[052768ef] + CUDA v4.0.0
Updating `~/.julia/environments/v1.7/Manifest.toml`
[621f4979] + AbstractFFTs v1.2.1
[79e6a3ab] + Adapt v3.5.0
[ab4f0b2a] + BFloat16s v0.4.2
[fa961155] + CEnum v0.4.2
[052768ef] + CUDA v4.0.0
[1af6417a] + CUDA_Runtime_Discovery v0.1.1
[d360d2e6] + ChainRulesCore v1.15.7
[9e997f8a] + ChangesOfVariables v0.1.5
[34da2185] + Compat v4.6.0
[ffbed154] + DocStringExtensions v0.9.3
[e2ba6199] + ExprTools v0.1.8
[0c68f7d7] + GPUArrays v8.6.2
[46192b85] + GPUArraysCore v0.1.3
[61eb1bfa] + GPUCompiler v0.17.1
[3587e190] + InverseFunctions v0.1.8
[92d709cd] + IrrationalConstants v0.1.1
[692b3bcd] + JLLWrappers v1.4.1
[929cbde3] + LLVM v4.15.0
[2ab3a3ac] + LogExpFunctions v0.3.20
[21216c6a] + Preferences v1.3.0
[74087812] + Random123 v1.6.0
[e6cf234a] + RandomNumbers v1.5.3
[189a3867] + Reexport v1.2.2
[ae029012] + Requires v1.3.0
[276daf66] + SpecialFunctions v2.1.7
[a759f4b9] + TimerOutputs v0.5.22
[4ee394cb] + CUDA_Driver_jll v0.2.0+0
Downloaded artifact: CUDA_Driver
→ [76a88914] + CUDA_Runtime_jll v0.2.3+2
[dad2f222] + LLVMExtra_jll v0.0.16+0
[efe28fd5] + OpenSpecFun_jll v0.5.5+0
[56f22d72] + Artifacts
[2a0f44e3] + Base64
[ade2ca70] + Dates
[f43a241f] + Downloads
[b77e0a4c] + InteractiveUtils
[4af54fe1] + LazyArtifacts
[b27032c2] + LibCURL
[76f85450] + LibGit2
[8f399da3] + Libdl
[37e2e46d] + LinearAlgebra
[56ddb016] + Logging
[d6f4376e] + Markdown
[ca575930] + NetworkOptions
[44cfe95a] + Pkg
[de0858da] + Printf
[3fa0cd96] + REPL
[9a3f8284] + Random
[ea8e919c] + SHA
[9e88b42a] + Serialization
[6462fe0b] + Sockets
[2f01184e] + SparseArrays
[10745b16] + Statistics
[fa267f1f] + TOML
[a4e569a6] + Tar
[8dfed614] + Test
[cf7118a7] + UUIDs
[4ec0a83e] + Unicode
[e66e0078] + CompilerSupportLibraries_jll
[deac9b47] + LibCURL_jll
[29816b5a] + LibSSH2_jll
[c8ffd9c3] + MbedTLS_jll
[14a3606d] + MozillaCACerts_jll
[4536629a] + OpenBLAS_jll
[05823500] + OpenLibm_jll
[83775a58] + Zlib_jll
[8e850b90] + libblastrampoline_jll
[8e850ede] + nghttp2_jll
[3f19e933] + p7zip_jll
Info packages marked with → not downloaded, use `instantiate` to download
Downloaded artifact: CUDA_Runtime
Precompiling project...
34 dependencies successfully precompiled in 91 seconds (3 already precompiled)
CUDA runtime 11.8, artifact installation
CUDA driver 12.0
NVIDIA driver 525.60.13
Libraries:
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 12.0.0+525.60.13
Toolchain:
- Julia: 1.7.2
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
7 devices:
0: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
1: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
2: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
3: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
4: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
5: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)
6: NVIDIA GeForce RTX 2080 Ti (sm_75, 10.750 GiB / 11.000 GiB available)