Strange crash using openMPI, help with error mesage?

Hi everyone,

I have a code that uses openMPI (and openMP) to do a parallel simulation of deforming three dimensional meshes. Until recently the code was working fine, and I have just now encountered a situation (a specific input file) for which the code crashes in a reproducible manner. The error message I get comes from openMPI and is complete gibberish to me. I dump it at the end below, in the hope that someone understands what it means. Also, here is the bash script I use to launch the simulation (as you see, it requires many cpus):

#!/bin/sh
#SBATCH --job-name cyst_ad2
#SBATCH --error out_err
#SBATCH --output out_stdout
#SBATCH --ntasks 42
#SBATCH --cpus-per-task 2
#SBATCH --partition shared-cpu
#Here we specify to only want to execute code on cpus with generation at least 3 because gen 2 does not have AVX
#SBATCH --constraint="V3|V4|V5|V6|V7|V8|V9"
#SBATCH --time 12:00:00
#SBATCH --mem-per-cpu=1000

srun ./iasTissueSimulation config.cfg

Here is a link to the error message, because it is too big for the forum character limit —> https://pastebin.com/raw/xbbVSKBm

Does this ring a bell for anyone?

Many thanks,
Quentin

Hi,

as this is not included in your sbatch: what is the module line you are using? And are you sure the module was loaded before submitting the job?

As a best practice, it is good to include this line in your sbatch to be sure it is loaded.

Best

Hi, yes sorry I did not mention the modules. For this simulation I simply need to load foss/2020b, which is composed of:

1) GCCcore/10.2.0   4) GCC/10.2.0       7) libxml2/2.9.10     10) libevent/2.1.12   13) PMIx/3.1.5       16) FFTW/3.3.8
  2) zlib/1.2.11      5) numactl/2.0.13   8) libpciaccess/0.16  11) UCX/1.9.0         14) OpenMPI/4.0.5    17) ScaLAPACK/2.1.0
  3) binutils/2.35    6) XZ/5.2.5         9) hwloc/2.2.0        12) libfabric/1.11.0  15) OpenBLAS/0.3.12  18) foss/2020b

Then my simulation relies on external libraries which are trilinos and vtk which I both ended up compiling myself instead of loading as modules. I don’t know if this will help.

Hi,

you said it crashes in a reproducible way: can you make it crash on a couple of identified nodes? The smallest possible number of nodes.
As the nodes aren’t heterogeneous, we should then be able to see if this software crash on a specific hardware.

Yeah that’s a good idea! I also had in mind to launch the same code in a different cluster, to see if it somehow changes the answer.

I found a way around the bug for now, so it’s not the highest priority to solve it, but I’ll post here again if I am able to do some of these tests.

Thanks!