Issue with GPU on CentOS7

Pablo.Strasser · May 8, 2019, 11:37am

Hi,

To help you find the problem, here is a very short minimal example.

First login to login node 1 (there is a problem with login node 2 I just discovered it will be covered in another message).
Build the official pytorch image. For ease of use I uploaded the image on dockerhub.

singularity build pytorch.simg docker://pablostrasser/pytorch:latest

Create a basic python script:

import torch
cuda = torch.device(‘cuda’)
a=torch.zeros(10,device=cuda)
print(a)
Execute the script:

srun -p kalousis-gpu-EL7 --gres=gpu:1 singularity exec --nv pytorch.simg python /home/strassp6/scratch/pytorchTest.py

The script fail with:

THCudaCheck FAIL file=…/aten/src/THC/THCGeneral.cpp line=51 error=999 : unknown error
Traceback (most recent call last):
File “/home/strassp6/scratch/pytorchTest.py”, line 3, in
a=torch.zeros(10,device=cuda)
File “/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py”, line 163, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (999) : unknown error at …/aten/src/THC/THCGeneral.cpp:51
srun: error: gpu008: task 0: Exited with exit code 1

For reminder we had the same problem at the last update.

I hope this help.
I’m available if you have more questions.

Pablo Strasser