Run-time problem with an MPI program

Hi,

I have a C program with a simple parallelization of a “for loop” (without any communication). My program compiles and starts running correctly but after a few hours, the program crashes with the following error message:

"mpirun noticed that process rank 17 with PID 187374 on node node066 exited on signal 9 (Killed)

/var/spool/slurmd/job39940963/slurm_script: line 13: /home/moradine: Is a directory"

Can you give me advice on what can be the cause of the error and how I can go about sorting it out?

Thanks,
Azadeh

Dear Azadeh,

yes of course! No, just joking, sorry, we need more information to help you.

Please share your sbatch script here. You are using OpenMPI to parallelize a loop on more than one node? If this is not the case maybe you should use OpenMP instead.

Feel free to share sniplets of your code as well if it could help us to understand the issue.

Best

Hi Yann,

Yes, I am using OpenMPI on more than one node. Since my code has several global variables, previously when I tried parallelization with openmp, I didnt manage to get it to work. So I went for zero-communication MPI.

Here is what I have in my batch script:

#!/bin/bash
#SBATCH -J Mnu               #Jobname
#SBATCH -e Mnu-err_%j.error  #Jobname error file
#SBATCH -o Mnu-out_%j.out    #Jobname out file  
#SBATCH -N 5                     #number of nodes 
#SBATCH -n 60                    #total number of task                                          
#SBATCH -t 01-00:00              # Runtime in D-HH:MM
#SBATCH --mem-per-cpu=4000
#SBATCH -p dpt-EL7              # Partition to submit to

module load GCC/7.3.0-2.30  OpenMPI/3.1.1 Valgrind/3.14.0 GSL/2.5 OpenBLAS/0.3.1 FFTW/3.3.8
mpirun -n 60 run.exe

The part of my code that is parallelized is the following:

 int main(int argc, char *argv[])
{
 MPI_Init(&argc,&argv);
 int rank,size;
 MPI_Comm_size(MPI_COMM_WORLD,&size);
 MPI_Comm_rank(MPI_COMM_WORLD,&rank);

int n1 = 10;
int n2 = 10;
double *fsky_sf  = make_1Darray(n1);
double *noise_sf = make_1Darray(n2);

int n1 = 10;
int n2 = 10;
double *fsky_sf  = make_1Darray(n1);
double *noise_sf = make_1Darray(n2);

fsky_sf[0]  = 1.;
noise_sf[0] = 1.;
for(i=1;i<n1;i++)
    fsky_sf[i] = 2. * fsky_sf[i-1];
for(i=1;i<n2;i++)
     noise_sf[i] = 1./pow(3.,i);
int JJ[6]    = {2,3,4,5,6,0};

int p = 0, m;
int num_elem = n1 * n2 * nlines;
double **input_mat = make_2Darray(num_elem,3);
for(m=0;m<6;m++){
         for(i=0;i<n1;i++){
            for(j=0;j<n2;j++)
            {
                    input_mat[p][0] = fsky_sf[i];
                    input_mat[p][1] = noise_sf[j];
                     input_mat[p][2] = (double) JJ[m];
                     p += 1;
           }
    }
}

int start = rank * num_elem/size;
int end = (rank+1) * num_elem/size;

for(j=start;j<end;j++)
                   InvFisher_matrix_integrated_zsummed(x, nput_mat[j], rank);

MPI_finalize();

return 0;
}
`

The function `InvFisher_matrix_integrated_zsummed()` is defined in another module and have several input variables, that here I showed as x.

Hello,

thanks for posting your sbatch script.

Your sbatch script has only 12 lines. Maybe you modified it in the meantime?

Do not specify the number of nodes, only specify the number of tasks. Not every node has 12 cpus.

replace mpirun -n 60 by srun without specifying the number of tasks.

You can check why the job crashed:

[root@master pillar]# sacct --format=Start,AveCPU,State%20,MaxRSS,JobID,NodeList%30,ReqMem --units=G -j 39940963
              Start     AveCPU                State     MaxRSS        JobID                       NodeList     ReqMem
------------------- ---------- -------------------- ---------- ------------ ------------------------------ ----------
2020-11-04T19:22:31                          FAILED            39940963          node[064,066,123,134,147]     3.91Gc
2020-11-04T19:22:31 1-17:08:20               FAILED     47.98G 39940963.ba+                        node064     3.91Gc
2020-11-04T19:22:34 1-13:11:29        OUT_OF_MEMORY     48.38G 39940963.0            node[066,123,134,147]     3.91Gc

Let us know if you still have the issue. If this is the case, please give me the full path of your sbatch script.

Best

Yann

Hi,

Thanks for the suggestions. Changing my batch script as you suggested and using “srun run.exe” I get a different error message:

slurmstepd: error: Detected 1 oom-kill event(s) in step 40135269.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: node084: task 16: Out Of Memory
srun: First task exited 30s ago
srun: step:40135269.0 tasks 0-15,17-59: running
srun: step:40135269.0 task 16: exited abnormally
srun: Terminating job step 40135269.0
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
/var/spool/slurmd/job40135269/slurm_script: line 12: /home/moradine: Is a directory

Here is the full path to my batch script: /home/moradine/LIM_proposal_codes/relics_neutrino/TH_ours/batch.slurm

Thanks,
Azadeh

Hi,

Hello, did you fully read my previous post? I show you the output of sacct which clearly show that you are running out of memory. You should ask more cores or more memory per core. You can check memory usage of your job with sacct or sstat

Please try to use correct formating when posting, it’s easier to read.

Best

Hi Yann,

Thanks for the response. It is indeed clear that I am running out of memory. However, the question is what is the reason behind this. I have checked that my code doesn’t have any memory leaks, by running it on 2 cores on my local machine. So it seems the issue is in the resources I am asking for on baobab. I have checked that I get the same problem on baobab requesting a larger/smaller number of tasks. Can you give advice on what I should change in my batch script? Another question I have is why when I change the mpirun to srun, as you suggested, I get a different error message?

Thanks,
Azadeh

On baobab per default you are given 3 GB per core. If you are able to run the code locally you can check the memory usage with htop and put the correct amount of memory. You can also increase the memory to a higher value like 16 GB and decrease it if it is too much.