Hello,
Sorry I was in vacation yesterday,
sacct -j 6429959 -o node,jobid,state
NodeList JobID State
--------------- ------------ ----------
gpu020 6429959_1 CANCELLED+
gpu020 6429959_1.b+ CANCELLED
gpu020 6429959_1.e+ COMPLETED
gpu020 6429959_1.0 FAILED
gpu020 6429959_2 CANCELLED+
gpu020 6429959_2.b+ CANCELLED
gpu020 6429959_2.e+ COMPLETED
gpu020 6429959_2.0 FAILED
gpu030 6429959_3 FAILED
gpu030 6429959_3.b+ FAILED
gpu030 6429959_3.e+ COMPLETED
gpu030 6429959_4 FAILED
gpu030 6429959_4.b+ FAILED
gpu030 6429959_4.e+ COMPLETED
gpu030 6429959_5 FAILED
gpu030 6429959_5.b+ FAILED
gpu030 6429959_5.e+ COMPLETED
gpu030 6429959_6 FAILED
gpu030 6429959_6.b+ FAILED
gpu030 6429959_6.e+ COMPLETED
gpu030 6429959_7 FAILED
gpu030 6429959_7.b+ FAILED
gpu030 6429959_7.e+ COMPLETED
gpu030 6429959_8 FAILED
gpu030 6429959_... FAILED
gpu030 6429959_39 FAILED
gpu030 6429959_39.+ FAILED
gpu030 6429959_39.+ COMPLETED
gpu030 6429959_40 FAILED
gpu030 6429959_40.+ FAILED
gpu030 6429959_40.+ COMPLETED
gpu028 6429959_41 CANCELLED+
gpu028 6429959_41.+ CANCELLED
gpu028 6429959_41.+ COMPLETED
gpu028 6429959_41.0 FAILED
gpu029 6429959_42 CANCELLED+
gpu029 6429959_42.+ CANCELLED
gpu029 6429959_42.+ COMPLETED
gpu029 6429959_42.0 CANCELLED
gpu032 6429959_43 CANCELLED+
gpu032 6429959_43.+ CANCELLED
gpu032 6429959_43.+ COMPLETED
gpu032 6429959_43.0 FAILED
gpu030 6429959_44 FAILED
gpu030 6429959_44.+ FAILED
gpu030 6429959_44.+ COMPLETED
gpu030 6429959_45 FAILED
gpu030 6429959_... FAILED
gpu030 6429959_571 FAILED
gpu030 6429959_571+ FAILED
gpu030 6429959_571+ COMPLETED
gpu027 6429959_572 CANCELLED+
gpu027 6429959_572+ CANCELLED
gpu027 6429959_572+ COMPLETED
gpu027 6429959_572+ FAILED
gpu030 6429959_573 FAILED
gpu030 6429959_573+ FAILED
gpu030 6429959_573+ COMPLETED
gpu030 6429959_... FAILED
gpu030 6429959_764 FAILED
gpu030 6429959_764+ FAILED
...
It seems that only jobs on GPU030 are failling.
The CUDA version of julia is:
CUDA runtime 11.8, artifact installation
CUDA driver 12.3
NVIDIA driver 545.23.8
Libraries:
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 12.0.0+545.23.8
Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86
4 devices:
0: NVIDIA A100-PCIE-40GB (sm_80, 39.550 GiB / 40.000 GiB available)
1: NVIDIA A100-PCIE-40GB (sm_80, 39.550 GiB / 40.000 GiB available)
2: NVIDIA A100-PCIE-40GB (sm_80, 39.550 GiB / 40.000 GiB available)
3: NVIDIA A100-PCIE-40GB (sm_80, 39.550 GiB / 40.000 GiB available)
(tested on GPU 030)
I’ll try to run my simulation using the reservation node.
Thank you for your help