[Solved] Is there a problem with gpu014? Baobab

Is something going on on gpu014 Baobab?

I had problems with 2 different runs that landed on gpu014 (shared-gpu partition)

1 One openMM run Error initializing CUDA: CUDA_ERROR_NOT_INITIALIZED (3) but I usually have no problems in using cuda on shared-gpu

2 one gromacs run that shall take max 3-4 hours wasn’t done after more than 10 hours

Maybe are the GPUs not working on that node?

@Yann.Sagon

Hi Maurice,

indeed:

[root@gpu014 ~]# nvidia-smi
Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost.  Reboot the system to recover this GPU

First time I see this error!

After reboot:

[root@gpu014 ~]# nvidia-smi
Thu May  6 11:50:10 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:21:00.0 Off |                  N/A |
| 27%   32C    P8     1W / 250W |      1MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:22:00.0 Off |                  N/A |
| 27%   33C    P8    10W / 250W |      1MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  On   | 00000000:41:00.0 Off |                  N/A |
| 27%   34C    P8    21W / 250W |      1MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  On   | 00000000:81:00.0 Off |                  N/A |
| 27%   33C    P8     9W / 250W |      1MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  GeForce RTX 208...  On   | 00000000:A1:00.0 Off |                  N/A |
| 27%   33C    P8     2W / 250W |      1MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  GeForce RTX 208...  On   | 00000000:C1:00.0 Off |                  N/A |
| 27%   34C    P8    20W / 250W |      1MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  GeForce RTX 208...  On   | 00000000:C2:00.0 Off |                  N/A |
| 27%   33C    P8    21W / 250W |      1MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Thanks for the notification.

Best

1 Like

A post was split to a new topic: Issue with gpu020

Hi, I didn’t checked correctly on gpu014. After the reboot one of the GPU card disappeared. We are checking what’s going on.

1 Like

After power off and on again, we ran GPU burn test for two hours on gpu014 and everything fine. Node resumed.

1 Like

Thank you very much!

Hi Yann,

I have similar problem with gpu002.