Getting started with TensorFlow on Baobab (ImportError: libcuda.so.1)

Dear all,
I’m just starting to use Baobab. I’m having difficulty running even the TensorFlow helloworld.py example ( https://gitlab.unige.ch/hpc/softs/tree/master/t/tensorflow/hello ). Ideally I would actually like to use TensorFlow with gpu (and python 3). The most fitting module I found on Baobab was:

TensorFlow/1.7.0-Python-3.6.4

In the module info page it tells me to load GCC/6.4.0-2.28 and OpenMPI/2.1.2.
So I tried to do:
module load GCC/6.4.0-2.28 OpenMPI/2.1.2 TensorFlow/1.7.0-Python-3.6.4
srun python helloworld.py
which gave:

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

So I thought it’s a CUDA issue. I looked at the examples on gitlab and tried running both
testTensorFlow_1.7.0.sh
testTensorFlow_1.10.1.sh

but they both gave me the same error.

Could someone help me out with just doing the TensorFlow ‘Hello World!’? Thanks!

Hi,
I just tested the example and pushed a fix that work here (t/tensorflow/hello · master · Pablo.Strasser / softs · GitLab ). If one of the admins could merge this commit it would be nice.

Note also that these example should not be run as a bash script ie

./testTensorFlow_1.10.1.sh

But as an sbatch script as:

sbatch testTensorFlow_1.10.1.sh

As running it will ignore some important option that are comment for sbatch (like the fact that you want to use a partition with gpus).

I hope this help.

1 Like

Hi Pablo, thanks a lot for the fix and the extra note for sbatch. Works perfectly now!

Note also that the tensorflow installed on baobab require Cuda and to solve some mixing problem between libraries I think Cuda is only installed on GPUs node. This mean that it is not possible to use tensorflow on cpu only on non GPUs node.

1 Like

Hi there,

Thank you for the patch, merged (cf. Merge remote-tracking branch 'gitlab.unige.ch_Pablo.Strasser/master' (d10d722d) · Commits · hpc / softs · GitLab ).

FWIW, to ease such workflow, for the next time please fork the GitLab project hpc / softs · GitLab project and send a proper merge request.

You are right about the CUDA system libraries, which are available on GPU nodes only. Here the reason: while moving to CentOS 7 (cf. Baobab migration from CentOS6 to CentOS7 ) we decided to install as less extra software as possible, or, in other words, to have a basic installation shared between all the nodes and the servers as well. CUDA is obviously not part of a basic installation…

From a quick look, it seems that the CUDA application libraries loaded via module do not include libcuda.so* , but only a stub, which should refer to the corresponding system library (my guess is that the latter communicates with the NVIDIA kernel driver).

Moreover, according to the upstream documentation (cf. Build from source  |  TensorFlow ) while compiling, TensorFlow creates symbolic links to the CUDA system libraries, which de facto renders the compiled TensorFlow not portable.

Thx, bye,
Luca

In their release, they normally have two versions tensorflow-gpu and tensorflow. I believe that the tensorflow cpu only version does not link to CUDA and other GPUs only libraries.
I don’t know if there is a demand for having a CPU only tensorflow installed in the cluster. In all case I think the documentation (https://baobab.unige.ch/enduser/src/enduser/applications.html#tensorflow) should explicitly indicate this module only work on GPUs nodes.

Hi there,

Done, here the current text:

Thx, bye,
Luca