Hello,
There is a problem with Cuda+Singularity on GPU010.
Running my simple test script I obtain the following error:
[strassp6@login1 ~]$ srun -p cui-gpu-EL7 --gres=gpu:1 singularity exec --nv /home/strassp6/scratch/pytorch.simg python /home/strassp6/pytorchCheck.py
srun: job 18319242 queued and waiting for resources
srun: job 18319242 has been allocated resources
THCudaCheck FAIL file=…/aten/src/THC/THCGeneral.cpp line=51 error=999 : unknown error
Traceback (most recent call last):
File “/home/strassp6/pytorchCheck.py”, line 3, in
a = torch.zeros(10,device=cuda)
File “/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py”, line 163, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (999) : unknown error at …/aten/src/THC/THCGeneral.cpp:51
srun: error: gpu010: task 0: Exited with exit code 1
The test code is the following:
import torch
cuda = torch.device(‘cuda’)
a = torch.zeros(10,device=cuda)
print(a)
The singularity image was build with:
singularity build pytorch.simg docker://pablostrasser/pytorch:latest
Thanks in advance for your help.
Sorry used the wrong terminal command in fact the problem is still here:
srun -p shared-gpu-EL7 --nodelist=gpu010 --gres=gpu:1 singularity exec --nv /home/strassp6/scratch/pytorch.simg python /home/strassp6/pytorchCheck.py
THCudaCheck FAIL file=…/aten/src/THC/THCGeneral.cpp line=51 error=999 : unknown error
Traceback (most recent call last):
File “/home/strassp6/pytorchCheck.py”, line 3, in
a = torch.zeros(10,device=cuda)
File “/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py”, line 163, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (999) : unknown error at …/aten/src/THC/THCGeneral.cpp:51
srun: error: gpu010: task 0: Exited with exit code 1
Hi there,
NB, this seems exactly the same problem as Issue with GPU on CentOS7 .
The CUDA upstream deviceQuery
does not report any error, test case available at https://gitlab.unige.ch/hpc/softs/tree/ff8b7626113206871ad380ad496b327bc8fa7aa8/c/cuda (launched on gpu010/Slurm-18517239, gpu009/Slurm-18517272 and gpu008/Slurm-18517273 ).
Now back to pythorch:
- a simple
torch.cuda.device_count
works with module:PyTorch/0.3.0-Python-3.6.4, test case available at https://gitlab.unige.ch/hpc/softs/tree/3de4a730f5d8c617e2586fda7058bb7ae0eeb66b/p/pytorch (launched on gpu010/Slurm-18574803, gpu009/Slurm-18574804 and gpu008/Slurm-18574805).
- your Pytorch in Singularity test works as well, test case available at https://gitlab.unige.ch/hpc/softs/commit/b9973e982654776742faefd79f016777e9ad56e6 (launched on gpu010/Slurm-18693217, gpu009/Slurm-18693287 and gpu008/Slurm-18693288, after having built the image as you suggested).
@Pablo.Strasser , can you please test again with a clean build, please?
Thx, bye,
Luca
I have a test job in the queue. Will tell you if it work.
Pablo
Ok I just confirmed it work now. The problem is solved.