Hi Hugues and Luca,
I created a python env (3.7.4) with tensorflow / tensorflow-gpu == 2.4.1, though when I run my script on the partition debug-gpu
after module load fosscuda/2019b
I get the following error. How do I correctly set up the environment? Here is the traceback of the error
2021-07-25 16:36:20.099390: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-25 16:36:28.150631: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-07-25 16:36:28.153085: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-07-25 16:36:28.194712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:1a:00.0 name: NVIDIA TITAN RTX computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 23.65GiB deviceMemoryBandwidth: 625.94GiB/s
2021-07-25 16:36:28.194749: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-25 16:36:28.199869: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-07-25 16:36:28.200015: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-07-25 16:36:28.203326: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-07-25 16:36:28.205168: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-07-25 16:36:28.208032: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-07-25 16:36:28.209667: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-07-25 16:36:28.210380: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ebsofts/ScaLAPACK/2.0.2-gompic-2019b/lib:/opt/ebsofts/FFTW/3.3.8-gompic-2019b/lib:/opt/ebsofts/OpenBLAS/0.3.7-GCC-8.3.0/lib:/opt/ebsofts/OpenMPI/3.1.4-gcccuda-2019b/lib:/opt/ebsofts/hwloc/1.11.12-GCCcore-8.3.0/lib:/opt/ebsofts/libpciaccess/0.14-GCCcore-8.3.0/lib:/opt/ebsofts/libxml2/2.9.9-GCCcore-8.3.0/lib:/opt/ebsofts/Compiler/GCCcore/8.3.0/XZ/5.2.4/lib:/opt/ebsofts/numactl/2.0.12-GCCcore-8.3.0/lib:/opt/ebsofts/Compiler/GCC/8.3.0/CUDA/10.1.243/nvvm/lib64:/opt/ebsofts/Compiler/GCC/8.3.0/CUDA/10.1.243/extras/CUPTI/lib64:/opt/ebsofts/Compiler/GCC/8.3.0/CUDA/10.1.243/lib64:/opt/ebsofts/Compiler/GCCcore/8.3.0/binutils/2.32/lib:/opt/ebsofts/Compiler/GCCcore/8.3.0/zlib/1.2.11/lib:/opt/ebsofts/Core/GCCcore/8.3.0/lib/gcc/x86_64-pc-linux-gnu/8.3.0:/opt/ebsofts/Core/GCCcore/8.3.0/lib64:/opt/ebsofts/Core/GCCcore/8.3.0/lib
2021-07-25 16:36:28.210457: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
edit: I have created a new thread to avoid mixing content in the same thread. YS
See ^ (not sure if you got notified)
Hi Simone,
may I ask that you show your sbatch script please?
Best
Yann
According to Build from source | TensorFlow you need CUDA 11 or later for TF 2.4.
Please see here to determine which module you need to load to have CUDA 11 or later.
The documentation you linked indicates to use fosscuda/2020a
for CUDA 11.0 which is compatible with tensorflow 2.4. However, similar to before, when I sbatch this
#!/bin/bash
#SBATCH --account=meynet
#SBATCH --partition=debug-gpu
#SBATCH -N 1
#SBATCH --gpus=1
#SBATCH --cpus-per-task 1
#SBATCH --ntasks-per-node 1
#SBATCH --time=00:01:00
#SBATCH --job-name="psygrid_\${SLURM_ARRAY_TASK_ID}"
#SBATCH --output=/srv/beegfs/scratch/users/b/bavera/g2net/modelling/logs/gw_modelling_%a.out
#SBATCH --mail-type=ALL
#SBATCH --mail-user=simone.bavera@unige.ch
#SBATCH --mem-per-cpu=32G
module load fosscuda/2020a
srun python /srv/beegfs/scratch/users/b/bavera/g2net/modelling/script.py
I get the following error
2021-07-26 11:59:27.383879: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-26 11:59:33.625158: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-07-26 11:59:33.628068: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-07-26 11:59:33.671992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:1a:00.0 name: NVIDIA TITAN RTX computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 23.65GiB deviceMemoryBandwidth: 625.94GiB/s
2021-07-26 11:59:33.672030: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-26 11:59:33.677107: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-07-26 11:59:33.677228: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-07-26 11:59:33.679671: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-07-26 11:59:33.681759: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-07-26 11:59:33.685812: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-07-26 11:59:33.687730: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-07-26 11:59:33.688662: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ebsofts/ScaLAPACK/2.1.0-gompic-2020a/lib:/opt/ebsofts/FFTW/3.3.8-gompic-2020a/lib:/opt/ebsofts/OpenBLAS/0.3.9-GCC-9.3.0/lib:/opt/ebsofts/OpenMPI/4.0.3-gcccuda-2020a/lib:/opt/ebsofts/PMIx/3.1.5-GCCcore-9.3.0/lib:/opt/ebsofts/libfabric/1.11.0-GCCcore-9.3.0/lib:/opt/ebsofts/UCX/1.8.0-GCCcore-9.3.0-CUDA-11.0.2/lib:/opt/ebsofts/GDRCopy/2.1-GCCcore-9.3.0-CUDA-11.0.2/lib64:/opt/ebsofts/Check/0.15.2-GCCcore-9.3.0/lib:/opt/ebsofts/libevent/2.1.11-GCCcore-9.3.0/lib:/opt/ebsofts/hwloc/2.2.0-GCCcore-9.3.0/lib:/opt/ebsofts/libpciaccess/0.16-GCCcore-9.3.0/lib:/opt/ebsofts/libxml2/2.9.10-GCCcore-9.3.0/lib:/opt/ebsofts/XZ/5.2.5-GCCcore-9.3.0/lib:/opt/ebsofts/numactl/2.0.13-GCCcore-9.3.0/lib:/opt/ebsofts/CUDAcore/11.0.2/nvvm/lib64:/opt/ebsofts/CUDAcore/11.0.2/extras/CUPTI/lib64:/opt/ebsofts/CUDAcore/11.0.2/lib64:/opt/ebsofts/binutils/2.34-GCCcore-9.3.0/lib:/opt/ebsofts/zlib/1.2.11-GCCcore-9.3.0/lib:/opt/ebsofts/GCCcore/9.3.0/lib64:/opt/ebsofts/GCCcore/9.3.0/lib
2021-07-26 11:59:33.688745: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Hi,
It seems you forgot to load cuDNN version 8.x:
ml cuDNN/8.0.4.30-CUDA-11.0.2
Thanks Yann, you are a hero, this solved my problem!
Follow up question. Is there a way to solve the following warning about CPU optimisation?
2021-07-26 14:29:50.647183: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-26 14:29:50.647723: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
Yeah
For the XLA stuff, you should check the documentation such as this one XLA | TensorFlow. Or maybe somone here have more experience and can help you.