Getting started with TensorFlow on Baobab (ImportError: libcuda.so.1)

Tamas.Krivachy · October 7, 2019, 3:22pm

Dear all,
I’m just starting to use Baobab. I’m having difficulty running even the TensorFlow helloworld.py example ( https://gitlab.unige.ch/hpc/softs/tree/master/t/tensorflow/hello ). Ideally I would actually like to use TensorFlow with gpu (and python 3). The most fitting module I found on Baobab was:

TensorFlow/1.7.0-Python-3.6.4

In the module info page it tells me to load GCC/6.4.0-2.28 and OpenMPI/2.1.2.
So I tried to do:
module load GCC/6.4.0-2.28 OpenMPI/2.1.2 TensorFlow/1.7.0-Python-3.6.4
srun python helloworld.py
which gave:

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

So I thought it’s a CUDA issue. I looked at the examples on gitlab and tried running both
testTensorFlow_1.7.0.sh
testTensorFlow_1.10.1.sh

but they both gave me the same error.

Could someone help me out with just doing the TensorFlow ‘Hello World!’? Thanks!

Pablo.Strasser · October 9, 2019, 9:35am

Hi,
I just tested the example and pushed a fix that work here (t/tensorflow/hello · master · Pablo.Strasser / softs · GitLab ). If one of the admins could merge this commit it would be nice.

Note also that these example should not be run as a bash script ie

./testTensorFlow_1.10.1.sh

But as an sbatch script as:

sbatch testTensorFlow_1.10.1.sh

As running it will ignore some important option that are comment for sbatch (like the fact that you want to use a partition with gpus).

I hope this help.

Tamas.Krivachy · October 9, 2019, 12:21pm

Hi Pablo, thanks a lot for the fix and the extra note for sbatch. Works perfectly now!

Pablo.Strasser · October 9, 2019, 1:00pm

Note also that the tensorflow installed on baobab require Cuda and to solve some mixing problem between libraries I think Cuda is only installed on GPUs node. This mean that it is not possible to use tensorflow on cpu only on non GPUs node.

Luca.Capello · October 9, 2019, 2:06pm

Hi there,

Thank you for the patch, merged (cf. Merge remote-tracking branch 'gitlab.unige.ch_Pablo.Strasser/master' (d10d722d) · Commits · hpc / softs · GitLab ).

FWIW, to ease such workflow, for the next time please fork the GitLab project hpc / softs · GitLab project and send a proper merge request.

You are right about the CUDA system libraries, which are available on GPU nodes only. Here the reason: while moving to CentOS 7 (cf. Baobab migration from CentOS6 to CentOS7 ) we decided to install as less extra software as possible, or, in other words, to have a basic installation shared between all the nodes and the servers as well. CUDA is obviously not part of a basic installation…

From a quick look, it seems that the CUDA application libraries loaded via module do not include libcuda.so* , but only a stub, which should refer to the corresponding system library (my guess is that the latter communicates with the NVIDIA kernel driver).

Moreover, according to the upstream documentation (cf. Build from source | TensorFlow ) while compiling, TensorFlow creates symbolic links to the CUDA system libraries, which de facto renders the compiled TensorFlow not portable.

Thx, bye,
Luca

Pablo.Strasser · October 9, 2019, 2:39pm

In their release, they normally have two versions tensorflow-gpu and tensorflow. I believe that the tensorflow cpu only version does not link to CUDA and other GPUs only libraries.
I don’t know if there is a demand for having a CPU only tensorflow installed in the cluster. In all case I think the documentation (https://baobab.unige.ch/enduser/src/enduser/applications.html#tensorflow) should explicitly indicate this module only work on GPUs nodes.

Luca.Capello · October 14, 2019, 12:32pm

Hi there,

Done, here the current text:

Thx, bye,
Luca