Hi everyone,
I have a code that uses openMPI (and openMP) to do a parallel simulation of deforming three dimensional meshes. Until recently the code was working fine, and I have just now encountered a situation (a specific input file) for which the code crashes in a reproducible manner. The error message I get comes from openMPI and is complete gibberish to me. I dump it at the end below, in the hope that someone understands what it means. Also, here is the bash script I use to launch the simulation (as you see, it requires many cpus):
#!/bin/sh
#SBATCH --job-name cyst_ad2
#SBATCH --error out_err
#SBATCH --output out_stdout
#SBATCH --ntasks 42
#SBATCH --cpus-per-task 2
#SBATCH --partition shared-cpu
#Here we specify to only want to execute code on cpus with generation at least 3 because gen 2 does not have AVX
#SBATCH --constraint="V3|V4|V5|V6|V7|V8|V9"
#SBATCH --time 12:00:00
#SBATCH --mem-per-cpu=1000
srun ./iasTissueSimulation config.cfg
Here is a link to the error message, because it is too big for the forum character limit —> https://pastebin.com/raw/xbbVSKBm
Does this ring a bell for anyone?
Many thanks,
Quentin
Hi,
as this is not included in your sbatch: what is the module
line you are using? And are you sure the module was loaded before submitting the job?
As a best practice, it is good to include this line in your sbatch to be sure it is loaded.
Best
Hi, yes sorry I did not mention the modules. For this simulation I simply need to load foss/2020b, which is composed of:
1) GCCcore/10.2.0 4) GCC/10.2.0 7) libxml2/2.9.10 10) libevent/2.1.12 13) PMIx/3.1.5 16) FFTW/3.3.8
2) zlib/1.2.11 5) numactl/2.0.13 8) libpciaccess/0.16 11) UCX/1.9.0 14) OpenMPI/4.0.5 17) ScaLAPACK/2.1.0
3) binutils/2.35 6) XZ/5.2.5 9) hwloc/2.2.0 12) libfabric/1.11.0 15) OpenBLAS/0.3.12 18) foss/2020b
Then my simulation relies on external libraries which are trilinos and vtk which I both ended up compiling myself instead of loading as modules. I don’t know if this will help.
Hi,
you said it crashes in a reproducible way: can you make it crash on a couple of identified nodes? The smallest possible number of nodes.
As the nodes aren’t heterogeneous, we should then be able to see if this software crash on a specific hardware.
Yeah that’s a good idea! I also had in mind to launch the same code in a different cluster, to see if it somehow changes the answer.
I found a way around the bug for now, so it’s not the highest priority to solve it, but I’ll post here again if I am able to do some of these tests.
Thanks!