Hi,
To help you find the problem, here is a very short minimal example.
- First login to login node 1 (there is a problem with login node 2 I just discovered it will be covered in another message).
- Build the official pytorch image. For ease of use I uploaded the image on dockerhub.
singularity build pytorch.simg docker://pablostrasser/pytorch:latest
- Create a basic python script:
import torch
cuda = torch.device(‘cuda’)
a=torch.zeros(10,device=cuda)
print(a) - Execute the script:
srun -p kalousis-gpu-EL7 --gres=gpu:1 singularity exec --nv pytorch.simg python /home/strassp6/scratch/pytorchTest.py
The script fail with:
THCudaCheck FAIL file=…/aten/src/THC/THCGeneral.cpp line=51 error=999 : unknown error
Traceback (most recent call last):
File “/home/strassp6/scratch/pytorchTest.py”, line 3, in
a=torch.zeros(10,device=cuda)
File “/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py”, line 163, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (999) : unknown error at …/aten/src/THC/THCGeneral.cpp:51
srun: error: gpu008: task 0: Exited with exit code 1
For reminder we had the same problem at the last update.
I hope this help.
I’m available if you have more questions.
Pablo Strasser