New software installed TensorFlow 2.0 for CUDA

Dear users,

finaly, it’s here on Baobab! TensorFlow 2.0 for CUDA is installed:

module load fosscuda/2019b TensorFlow/2.0.0-Python-3.7.4

Best

1 Like

Dear Yann,

thank you for installing this milestone release.
However, I have come across a problem with it in slurm.

When requesting and interactive gpu node with salloc and importing tensorflow in python, the following error occurs:

Python 3.7.4 (default, Nov  7 2019, 17:45:19) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2020-01-15 08:46:56.071056: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
[gpu002.cluster:28041] OPAL ERROR: Not initialized in file pmix2x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[gpu002.cluster:28041] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

When performing the same task but with a sbatch job submission, this problem does not occur, though a warning message is printed to the screen concerning a call to fork() and Open MPI.

For now I can run things in batch mode or with singularity, however for debugging it would be great to understand the issue with srun and whether there is a work around.

Cheers
Johnny


Interactive node tested with

[login2] $> salloc -n1 --partition=dpnc-gpu-EL7 --time=12:00:00 --mem=20G --gres=gpu:titan:1 srun -n1 -N1 --pty $SHELL
[gpu002] $> module load GCC/8.3.0  CUDA/10.1.243  OpenMPI/3.1.4
[gpu002] $> module load TensorFlow/2.0.0-Python-3.7.4
[gpu002] $> python
>>> import tensorflow as tf

Dear Johnny,

it seems OpenMPI/3.1.4 was built without pmi support. I’m not sure why it was working with sbatch. Anyway, I rebuilt it with pmi and it’s working as you were trying. Please give a try. The warning is still present, but this is related to TensorFlow.

Best

Dear Yann,

thanks for looking into this.
I have come across a peculiar problem when using tensorflow now with an interactive GPU session.

When I load the modules and open python, I can import tensorflow which is followed by the warnings but is completely usable.
However, if I exit python, and instantiate a new instance of python, when I load tensorflow it crashes with the previous warning about OpenMPI being compiled without pmi support.

Any insight you might have would be fantastic. Let me know if there is anything I can clarify, the print outs are below. I loaded TensorFlow/2.0.0-Python-3.7.4 and its prerequisites.

Cheers,
Johnny

[raine@gpu003:~]$ python
Python 3.7.4 (default, Nov  7 2019, 17:45:19) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2020-01-28 14:48:35.219878: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          [[53312,0],0] (PID 31888)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
>>> 
[raine@gpu003:~]$ python
Python 3.7.4 (default, Nov  7 2019, 17:45:19) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2020-01-28 14:48:47.660170: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
--------------------------------------------------------------------------
PMI2_Init failed to intialize.  Return code: 14

--------------------------------------------------------------------------
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[gpu003.cluster:31904] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

To follow up on this, It appears other users outside of Baobab had the same issue. I don’t know the solution.
Maybe the next release of “something” will fix it!

Best

Hi Yann,

good to know, hopefully a fix comes soon!
I have also noticed this behaviour when using a jupyter notebook (with tunnel) and trying to import this version of tensorflow. When I use an older version of tensorflow, running interactive notebooks runs smoothly and I can see that I have access to the GPUs.
I will see whether with singularity I can get this to work.

Cheers,
Johnny

For posterity, the error which crashes the kernel when importing tensorflow in a jupyter notebook.

[I 12:05:52.146 NotebookApp] Kernel started: f9daf158-7b87-4891-8f9b-b1504337938e
2020-03-06 12:05:54.115890: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
--------------------------------------------------------------------------
PMI2_Init failed to intialize.  Return code: 14

--------------------------------------------------------------------------
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
slurm-jupyter-30443630.out