I have a C program with a simple parallelization of a “for loop” (without any communication). My program compiles and starts running correctly but after a few hours, the program crashes with the following error message:
"mpirun noticed that process rank 17 with PID 187374 on node node066 exited on signal 9 (Killed)
/var/spool/slurmd/job39940963/slurm_script: line 13: /home/moradine: Is a directory"
Can you give me advice on what can be the cause of the error and how I can go about sorting it out?
yes of course! No, just joking, sorry, we need more information to help you.
Please share your sbatch script here. You are using OpenMPI to parallelize a loop on more than one node? If this is not the case maybe you should use OpenMP instead.
Feel free to share sniplets of your code as well if it could help us to understand the issue.
Yes, I am using OpenMPI on more than one node. Since my code has several global variables, previously when I tried parallelization with openmp, I didnt manage to get it to work. So I went for zero-communication MPI.
Here is what I have in my batch script:
#!/bin/bash
#SBATCH -J Mnu #Jobname
#SBATCH -e Mnu-err_%j.error #Jobname error file
#SBATCH -o Mnu-out_%j.out #Jobname out file
#SBATCH -N 5 #number of nodes
#SBATCH -n 60 #total number of task
#SBATCH -t 01-00:00 # Runtime in D-HH:MM
#SBATCH --mem-per-cpu=4000
#SBATCH -p dpt-EL7 # Partition to submit to
module load GCC/7.3.0-2.30 OpenMPI/3.1.1 Valgrind/3.14.0 GSL/2.5 OpenBLAS/0.3.1 FFTW/3.3.8
mpirun -n 60 run.exe
The part of my code that is parallelized is the following:
int main(int argc, char *argv[])
{
MPI_Init(&argc,&argv);
int rank,size;
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
int n1 = 10;
int n2 = 10;
double *fsky_sf = make_1Darray(n1);
double *noise_sf = make_1Darray(n2);
int n1 = 10;
int n2 = 10;
double *fsky_sf = make_1Darray(n1);
double *noise_sf = make_1Darray(n2);
fsky_sf[0] = 1.;
noise_sf[0] = 1.;
for(i=1;i<n1;i++)
fsky_sf[i] = 2. * fsky_sf[i-1];
for(i=1;i<n2;i++)
noise_sf[i] = 1./pow(3.,i);
int JJ[6] = {2,3,4,5,6,0};
int p = 0, m;
int num_elem = n1 * n2 * nlines;
double **input_mat = make_2Darray(num_elem,3);
for(m=0;m<6;m++){
for(i=0;i<n1;i++){
for(j=0;j<n2;j++)
{
input_mat[p][0] = fsky_sf[i];
input_mat[p][1] = noise_sf[j];
input_mat[p][2] = (double) JJ[m];
p += 1;
}
}
}
int start = rank * num_elem/size;
int end = (rank+1) * num_elem/size;
for(j=start;j<end;j++)
InvFisher_matrix_integrated_zsummed(x, nput_mat[j], rank);
MPI_finalize();
return 0;
}
`
The function `InvFisher_matrix_integrated_zsummed()` is defined in another module and have several input variables, that here I showed as x.
Thanks for the suggestions. Changing my batch script as you suggested and using “srun run.exe” I get a different error message:
slurmstepd: error: Detected 1 oom-kill event(s) in step 40135269.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: node084: task 16: Out Of Memory
srun: First task exited 30s ago
srun: step:40135269.0 tasks 0-15,17-59: running
srun: step:40135269.0 task 16: exited abnormally
srun: Terminating job step 40135269.0
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
/var/spool/slurmd/job40135269/slurm_script: line 12: /home/moradine: Is a directory
Here is the full path to my batch script: /home/moradine/LIM_proposal_codes/relics_neutrino/TH_ours/batch.slurm
Hello, did you fully read my previous post? I show you the output of sacct which clearly show that you are running out of memory. You should ask more cores or more memory per core. You can check memory usage of your job with sacct or sstat
Please try to use correct formating when posting, it’s easier to read.
Thanks for the response. It is indeed clear that I am running out of memory. However, the question is what is the reason behind this. I have checked that my code doesn’t have any memory leaks, by running it on 2 cores on my local machine. So it seems the issue is in the resources I am asking for on baobab. I have checked that I get the same problem on baobab requesting a larger/smaller number of tasks. Can you give advice on what I should change in my batch script? Another question I have is why when I change the mpirun to srun, as you suggested, I get a different error message?
On baobab per default you are given 3 GB per core. If you are able to run the code locally you can check the memory usage with htop and put the correct amount of memory. You can also increase the memory to a higher value like 16 GB and decrease it if it is too much.