Is something going on on gpu014 Baobab?
I had problems with 2 different runs that landed on gpu014 (shared-gpu partition)
1 One openMM run Error initializing CUDA: CUDA_ERROR_NOT_INITIALIZED (3)
but I usually have no problems in using cuda on shared-gpu
2 one gromacs run that shall take max 3-4 hours wasn’t done after more than 10 hours
Maybe are the GPUs not working on that node?
@Yann.Sagon
Hi Maurice,
indeed:
[root@gpu014 ~]# nvidia-smi
Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU
First time I see this error!
After reboot:
[root@gpu014 ~]# nvidia-smi
Thu May 6 11:50:10 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01 Driver Version: 455.45.01 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:21:00.0 Off | N/A |
| 27% 32C P8 1W / 250W | 1MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... On | 00000000:22:00.0 Off | N/A |
| 27% 33C P8 10W / 250W | 1MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... On | 00000000:41:00.0 Off | N/A |
| 27% 34C P8 21W / 250W | 1MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... On | 00000000:81:00.0 Off | N/A |
| 27% 33C P8 9W / 250W | 1MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 GeForce RTX 208... On | 00000000:A1:00.0 Off | N/A |
| 27% 33C P8 2W / 250W | 1MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 GeForce RTX 208... On | 00000000:C1:00.0 Off | N/A |
| 27% 34C P8 20W / 250W | 1MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 GeForce RTX 208... On | 00000000:C2:00.0 Off | N/A |
| 27% 33C P8 21W / 250W | 1MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Thanks for the notification.
Best
1 Like
Yann.Sagon
Split this topic
3
A post was split to a new topic: Issue with gpu020
Hi, I didn’t checked correctly on gpu014. After the reboot one of the GPU card disappeared. We are checking what’s going on.
1 Like
After power off and on again, we ran GPU burn test for two hours on gpu014 and everything fine. Node resumed.
1 Like
Hi Yann,
I have similar problem with gpu002.