GPU010 Cuda+ singularity Cuda Runtime error

Pablo.Strasser · June 25, 2019, 2:40pm

Hello,

There is a problem with Cuda+Singularity on GPU010.
Running my simple test script I obtain the following error:

[strassp6@login1 ~]$ srun -p cui-gpu-EL7 --gres=gpu:1 singularity exec --nv /home/strassp6/scratch/pytorch.simg python /home/strassp6/pytorchCheck.py
srun: job 18319242 queued and waiting for resources
srun: job 18319242 has been allocated resources
THCudaCheck FAIL file=…/aten/src/THC/THCGeneral.cpp line=51 error=999 : unknown error
Traceback (most recent call last):
File “/home/strassp6/pytorchCheck.py”, line 3, in
a = torch.zeros(10,device=cuda)
File “/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py”, line 163, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (999) : unknown error at …/aten/src/THC/THCGeneral.cpp:51
srun: error: gpu010: task 0: Exited with exit code 1

The test code is the following:

import torch
cuda = torch.device(‘cuda’)
a = torch.zeros(10,device=cuda)
print(a)

The singularity image was build with:

singularity build pytorch.simg docker://pablostrasser/pytorch:latest

Thanks in advance for your help.

Pablo.Strasser · June 27, 2019, 6:57am

The problem seem fixed.

[strassp6@login1 ~]$ srun -p shared-gpu-EL7 --exclude=gpu010 --gres=gpu:1 singularity exec --nv /home/strassp6/scratch/pytorch.simg python /home/strassp6/pytorchCheck.py
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device=‘cuda:0’)
[strassp6@login1 ~]$ srun -p shared-gpu-EL7 --exclude=gpu009 --gres=gpu:1 singularity exec --nv /home/strassp6/scratch/pytorch.simg python /home/strassp6/pytorchCheck.py
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device=‘cuda:0’)
[strassp6@login1 ~]$ srun -p shared-gpu-EL7 --exclude=gpu008 --gres=gpu:1 singularity exec --nv /home/strassp6/scratch/pytorch.simg python /home/strassp6/pytorchCheck.py
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device=‘cuda:0’)

Pablo.Strasser · June 27, 2019, 8:38am

Sorry used the wrong terminal command in fact the problem is still here:

srun -p shared-gpu-EL7 --nodelist=gpu010 --gres=gpu:1 singularity exec --nv /home/strassp6/scratch/pytorch.simg python /home/strassp6/pytorchCheck.py
THCudaCheck FAIL file=…/aten/src/THC/THCGeneral.cpp line=51 error=999 : unknown error
Traceback (most recent call last):
File “/home/strassp6/pytorchCheck.py”, line 3, in
a = torch.zeros(10,device=cuda)
File “/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py”, line 163, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (999) : unknown error at …/aten/src/THC/THCGeneral.cpp:51
srun: error: gpu010: task 0: Exited with exit code 1

Luca.Capello · July 3, 2019, 1:58pm

Hi there,

NB, this seems exactly the same problem as Issue with GPU on CentOS7 .

The CUDA upstream deviceQuery does not report any error, test case available at https://gitlab.unige.ch/hpc/softs/tree/ff8b7626113206871ad380ad496b327bc8fa7aa8/c/cuda (launched on gpu010/Slurm-18517239, gpu009/Slurm-18517272 and gpu008/Slurm-18517273 ).

Now back to pythorch:

a simple torch.cuda.device_count works with module:PyTorch/0.3.0-Python-3.6.4, test case available at https://gitlab.unige.ch/hpc/softs/tree/3de4a730f5d8c617e2586fda7058bb7ae0eeb66b/p/pytorch (launched on gpu010/Slurm-18574803, gpu009/Slurm-18574804 and gpu008/Slurm-18574805).
your Pytorch in Singularity test works as well, test case available at https://gitlab.unige.ch/hpc/softs/commit/b9973e982654776742faefd79f016777e9ad56e6 (launched on gpu010/Slurm-18693217, gpu009/Slurm-18693287 and gpu008/Slurm-18693288, after having built the image as you suggested).

@Pablo.Strasser , can you please test again with a clean build, please?

Thx, bye,
Luca

Pablo.Strasser · July 3, 2019, 2:16pm

I have a test job in the queue. Will tell you if it work.

Pablo

Pablo.Strasser · July 3, 2019, 4:11pm

Ok I just confirmed it work now. The problem is solved.