Should I be worried about the fork() warning?

Hi all,

I’m running some Tensorflow and Keras code and I get the following warning, though the code seems to be running fine. Is this something I should be worried about?

----------------------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          [[40761,0],0] (PID 9964)

If you are absolutely sure that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
----------------------------------------------------------------------------------------

The script I run with sbatch is based on https://gitlab.unige.ch/hpc/softs/blob/master/t/tensorflow/hello/testTensorFlow_1.7.0.sh and looks like:

#!/bin/sh

#SBATCH --cpus-per-task=1
#SBATCH --job-name=testTensorFlow
#SBATCH --ntasks=1
#SBATCH --time=11:58:00
#SBATCH --output=slurm-%J.out
#SBATCH --gres=gpu:titan:1
#SBATCH --constraint="V5|V6"
#SBATCH --partition=shared-gpu-EL7

## TensorFlow
module load GCC/6.4.0-2.28 OpenMPI/2.1.2 TensorFlow/1.7.0-Python-3.6.4 matplotlib/2.1.2-Python-3.6.4 Keras/2.1.6-Python-3.6.4
## CUDA
module load cuDNN/7.0.5-CUDA-9.1.85

srun python loop.py

Thanks!

Hi,
I’m not expert in using Tensorflow together with MPI.
However, I can give you the following guideline, writing correct multi threaded program can be difficult and the program can seem to run correctly 95% of the time but corrupt data the rest of the time. The main difficulty of multithread programming is that the ordering of the different thread are not guaranteed if not forced by the programmer which can lead to various problem like deadlocks, data race and corruption of data. The problem may have a high probability to occur only when the number of thread is high and the load is big. I think this warning was added to make aware the user of the “risk” when using this kind of call.

I normally recommend for this kind of operation to use as much as possible premade function which are supported.
Could you also give us more information of what you try to accomplish and also the code of loop.py?

Hi Pablo,

Thanks for the insight!
I use mostly premade functions. Basically in loop.py I have a for loop. In each iteration I run model.fit_generator on a keras neural network called model. Typically I have say 30 different models that I’m looping through like this. Once I finished looping through the 30, I go a few more rounds and train more to refine the parameters. So basically:

for i in range(times_to_run):
    for j in range(no_of_models):
        load model j from pass i-1 for model j (or initialize random weights if i==1)
        train model j
        save model parameters for next pass

I have some custom loss function, but I only use keras backend functions in it so I guess it should be alright.

Given your use case and if the call you do are asynchronous ensure to put a barrier (https://en.wikipedia.org/wiki/Barrier_(computer_science)) before or after the interior for loop. To ensure that the computation of each model of the previous pass are finished before going to the next pass.

1 Like