Dear HPC-Community,
I wanted to try out the Pytorch module (“PyTorch/1.6.0-Python-3.7.4” and “PyTorch/1.4.0-Python-3.7.4”) on Yggdrasil, but unfortunately it did not work as expected, since I got the following return from the debug-GPU (NVIDIA Titan RTX):
Output file content (PyTorch160.o) coming from Yggdrasil
Hostname: gpu001.yggdrasil Python 3.7.4
THCudaCheck FAIL file=…/aten/src/THC/THCGeneral.cpp line=47 error=100 : no CUDA-capable device is detected
x: cpu
tensor([[0.8530, 0.0887, 0.4857],
[0.4920, 0.5917, 0.1314],
[0.6559, 0.5153, 0.5836]])
y: cpu
tensor([[0.7562, 0.6895, 0.2346],
[0.7555, 0.7269, 0.7955],
[0.1314, 0.5913, 0.6127]])
z=x+y: cpu
tensor([[1.6092, 0.7782, 0.7203],
[1.2475, 1.3187, 0.9269],
[0.7873, 1.1065, 1.1963]])
Is CUDA available?: False
Traceback (most recent call last):
File “PyTorchTest.py”, line 18, in
a = torch.rand(3,3, device=‘cuda:0’); b = torch.rand(3,3, device=‘cuda:0’)
File “/opt/ebsofts/PyTorch/1.6.0-fosscuda-2019b-Python-3.7.4/lib/python3.7/site-packages/torch/cuda/init.py”, line 190, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at …/aten/src/THC/THCGeneral.cpp:47
srun: error: gpu001: task 0: Exited with exit code 1
My Python script has the following form:
Script (PyTorchTest.py)
import torch
#Testing PyTorch
x = torch.rand(3,3); y = torch.rand(3,3)
print(‘x:’, x.device, sep=‘\t’); print(x)
print(‘y:’, y.device, sep=‘\t’); print(y)z = x+y
print(‘z=x+y:’, z.device, sep=‘\t’); print(z)#Testing the presence of CUDA
print('Is CUDA available?: ', torch.cuda.is_available())a = torch.rand(3,3, device=‘cuda:0’); b = torch.rand(3,3, device=‘cuda:0’)
print(‘a:’, a.device, sep=‘\t’); print(a)
print(‘b:’, b.device, sep=‘\t’); print(b)
d = a+b
print(‘d=a+b:’, d.device, sep=‘\t’); print(d)
And I am submitting the job with the modules indicated in the documentation:
Job submission (Run160.sh)
#!/bin/sh
#SBATCH --job-name=PyTorchTest
#SBATCH --output=PyTorch160.o
#SBATCH --time=0-00:01:00#SBATCH --partition=debug-gpu
#SBATCH --ntasks=1
module load GCC/8.3.0 CUDA/10.1.243 OpenMPI/3.1.4
module load PyTorch/1.6.0-Python-3.7.4echo “Hostname: $(hostname -f)”
echo $CUDA_VISIBLE_DEVICESsrun python --version
srun python PyTorchTest.py
I tried the same kind of scripts on Baobab and it worked without any problem, even with the NVIDIA RTX 2080Ti (Architecture: Turing, Compute Capability: 7.5) that has the same architecture and “Compute Capabilty” as the debug-GPU on Yggdrasil (NVIDIA Titan RTX, Architecture: Turing, Compute Capability: 7.5) [Source]:
Output file content (PyTorch160.o) coming from Baobab
Hostname: gpu013.cluster
0
Python 3.7.4
x: cpu
tensor([[0.0963, 0.5734, 0.7547],
[0.6056, 0.9441, 0.3672],
[0.7199, 0.7243, 0.1161]])
y: cpu
tensor([[0.3504, 0.9175, 0.3713],
[0.0679, 0.1985, 0.0360],
[0.4479, 0.4846, 0.9477]])
z=x+y: cpu
tensor([[0.4466, 1.4909, 1.1260],
[0.6735, 1.1426, 0.4031],
[1.1678, 1.2090, 1.0638]])
Is CUDA available?: True
a: cuda:0
tensor([[0.3627, 0.1694, 0.3453],
[0.2654, 0.1715, 0.5954],
[0.9009, 0.9110, 0.6349]], device=‘cuda:0’)
b: cuda:0
tensor([[0.4165, 0.4420, 0.2125],
[0.3598, 0.3475, 0.7647],
[0.4621, 0.9076, 0.7042]], device=‘cuda:0’)
d=a+b: cuda:0
tensor([[0.7793, 0.6114, 0.5578],
[0.6252, 0.5190, 1.3601],
[1.3630, 1.8185, 1.3391]], device=‘cuda:0’)
Knowing that the NVIDIA Titan RTX GPU is compatible with CUDA [Source], maybe there is a problem with the drivers, or maybe PyTorch should simply be re-build on Yggdrasil?
Thank you in advance for your help.
Best regards,
Y. ABIPOUR