CuDNN/Libgpuarray issues on RTX nodes?

David.Droz · March 2, 2021, 9:57am

Dear HPC team,

I’m training a simple neural network using the Keras, Theano, Cuda and Libgpuarray libraries. My “module” calls look like this:

module load GCC/6.4.0-2.28 OpenMPI/2.1.2 Python/3.6.4
module load Theano/1.0.1-Python-3.6.4
module load Keras/2.1.6-Python-3.6.4
module load cuDNN/7.0.5-CUDA-9.1.85 libgpuarray/0.7.5
module load matplotlib/2.1.2-Python-3.6.4

which you can find under /home/drozd/analysis/runs/run_23Feb21_XinWeightNoCorr/testGPU

I tried running on the three kind of GPU resources available: titan, pascal, rtx. (You can find the logs in the same directory). For the first two, it works well and I have a message in the logs telling me the GPU was found:

Mapped name None to device cuda: Tesla P100-PCIE-12GB (0000:05:00.0)

However, running on RTX, I get a large error message:

ERROR (theano.gpuarray): Could not initialize pygpu, support disabled
Traceback (most recent call last):
File “/opt/ebsofts/MPI/GCC/6.4.0-2.28/OpenMPI/2.1.2/Theano/1.0.1-Python-3.6.4/lib/python3.6/site-packages/Theano-1.0.1-py3.6.egg/theano/gpuarray/init.py”, line 227, in
use(config.device)
File “/opt/ebsofts/MPI/GCC/6.4.0-2.28/OpenMPI/2.1.2/Theano/1.0.1-Python-3.6.4/lib/python3.6/site-packages/Theano-1.0.1-py3.6.egg/theano/gpuarray/init.py”, line 214, in use
init_dev(device, preallocate=preallocate)
File “/opt/ebsofts/MPI/GCC/6.4.0-2.28/OpenMPI/2.1.2/Theano/1.0.1-Python-3.6.4/lib/python3.6/site-packages/Theano-1.0.1-py3.6.egg/theano/gpuarray/init.py”, line 159, in init_dev
pygpu.blas.gemm(0, tmp, tmp, 0, tmp, overwrite_c=True)
File “pygpu/blas.pyx”, line 149, in pygpu.blas.gemm
File “pygpu/blas.pyx”, line 47, in pygpu.blas.pygpu_blas_rgemm
pygpu.gpuarray.GpuArrayException: (b’nvrtcCompileProgram: NVRTC_ERROR_INVALID_OPTION’, 3)

Plus some more messages, see /home/drozd/analysis/runs/run_23Feb21_XinWeightNoCorr/testGPU/log_rtx_43589994.out

The immediate work-around would be to not use RTX nodes. On that topic, is there a way to ask SLURM for either Titan or Pascal? I know you can use e.g. gres=gpu:pascal:1 , tried to coma-separate for Titan but Slurm didn’t recognise the option.

I’m not sure whether later versions of Cuda and libgpuarray would solve the issue, as they are not compatible with the installed Theano version.

Yann.Sagon · March 2, 2021, 6:10pm

Hello,

It’s not possible to put two types in the GRES request, but you can ask for specific compute capability.
You should use the Feature flag
Example:

#SBATCH --gres=gpu:1
#SBATCH --constraint="COMPUTE_CAPABILITY_6_0|COMPUTE_CAPABILITY_6_1"

[sagon@login2 ~] $ scontrol show node gpu[001-017] | grep AvailableFeatures
   AvailableFeatures=E5-2643V3,V5,COMPUTE_CAPABILITY_6_1,COMPUTE_TYPE_TITAN
   AvailableFeatures=E5-2630V4,V6,COMPUTE_CAPABILITY_6_0,COMPUTE_TYPE_PASCAL
   AvailableFeatures=E5-2630V4,V6,COMPUTE_CAPABILITY_6_0,COMPUTE_TYPE_PASCAL
   AvailableFeatures=E5-2630V4,V6,COMPUTE_CAPABILITY_6_0,COMPUTE_TYPE_PASCAL
   AvailableFeatures=E5-2630V4,V6,COMPUTE_CAPABILITY_6_0,COMPUTE_TYPE_PASCAL
   AvailableFeatures=E5-2630V4,V6,COMPUTE_CAPABILITY_6_0,COMPUTE_TYPE_TITAN
   AvailableFeatures=E5-2630V4,V6,COMPUTE_CAPABILITY_6_1,COMPUTE_TYPE_TITAN
   AvailableFeatures=E5-2630V4,V6,COMPUTE_CAPABILITY_6_1,COMPUTE_TYPE_TITAN
   AvailableFeatures=EPYC-7601,V7,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_RTX
   AvailableFeatures=E5-2643V3,V5,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_RTX
   AvailableFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_RTX
   AvailableFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_RTX
   AvailableFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_RTX
   AvailableFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_RTX
   AvailableFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_RTX