[ Solved ] Srun: error: gpu014: task 0: Segmentation fault

Dear @Yann.Sagon

I am using Gromacs-2019.4 (compiled on my $HOME)

loaded modules: foss/2018b and CUDA/10.1.243 (I know it’s a bit old stuff but I had to use this)

Gromacs is patched with Plumed v2.6

I am doing some test runs to see if everything works, the first very short (100 MD steps) runs completed with no problem, while when I tried to do a longer one (1000000 MD steps) after doing many MD steps successfully I get a GPU segmentation fault

[gpu014:142146:2:142172] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe027ae508)
[gpu014:142146:1:142173] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe027ae508)
[gpu014:142146:0:142170] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe027ae508)
==== backtrace ====
 0 0x0000000000c4fb6b do_pairs()  ???:0
 1 0x0000000000c46fce calcBondedForces()  listed-forces.cpp:0
 2 0x0000000000016c1e gomp_thread_start()  /dev/shm/ebbuild/GCCcore/7.3.0/dummy-/gcc-7.3.0/stage3_obj/x86_64-pc-linux-gnu/libgomp/../../../libgomp/team.c:120
 3 0x0000000000007ea5 start_thread()  pthread_create.c:0
 4 0x00000000000fe8dd __clone()  ???:0
srun: error: gpu014: task 0: Segmentation fault

The system is very small (6 atoms) so there should be no system dependent problems

the sbatch script is:

#!/bin/env bash
#SBATCH -J test_gromacs
#SBATCH -e %j.e
#SBATCH -o %j.o
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=6
#SBATCH -p gervasio-gpu-EL7
#SBATCH -t 10:15:00
#SBATCH --gres=gpu:rtx:1
#SBATCH --mem=MaxMemPerNode
source $HOME/gromacs/bin/GMXRC
export GMX=$HOME/bin/gmx_mpi
srun $GMX mdrun -deffnm $HOME/scratch/Test_gromacs/output/vdwgrow -s $HOME/scratch/Test_gromacs/ch2cl2_vdwgrow.tpr -mp $HOME/scratch/Test_gromacs/ch2cl2.top

I don’t understand if the problem is in how I complied (or patched) the MD code, or if it is SLURM and srun that cause it

I’m not an expert of the tools you use. A segmentation fault is generally caused by a wrongful memory access. This wrongful memory access can be due of a bug in your code or in one of the libraries or improper use of the libraries.

A few steps can help solve the problem:

  1. Does your code work on another system in the same condition?
  2. Does it work if you turn off all parallelization ?
  3. Does it work with other versions?
  4. Try to create the shortest and simplest version of your code that crash.
  5. Check if valgrind or/and asan (https://en.wikipedia.org/wiki/AddressSanitizer) find a problem in your code.

I hope that all theses steps help you.

1 Like

Thank you very much for your kind reply, in this case I was more interested in if SLURM could have messed up something or if my SBATCH script was wrong somewhere, because Gromacs is a huge open source code and dealing with bugs in there is quite annoying. So I wanted to rule out all the quicker to fix things first.

In any case I have seen that the same run works with different settings therefore the problem must be in Gromacs or in how I have set the system :frowning:

Thank you very much again and have a nice day!

Could you maybe give which setting you changed to make it work?
To give us an indication where the problem is?

Oh yes, sorry.

I am doing alchemical transformations, and to test them I am creating a di-cloro-ethilene in an empty box with PBC, if I create charge everything works (even though quite too slowly) probably because the molecule can feel the other images and therefore stay stable enough. While if I create vdw interactions (no charge) the molecule starts acting strangely, and gets like “disintegrated”. Gromacs first gives some warnings about the molecule rotating to much and at a certain moment I get the segmentation fault at the GPU level.
This means that the problem is not in how I set things up at the SLURM level, but somewhere inside Gromacs, or inside my gromacs input file (.mdp) that doesn’t cope well with quickly moving and rotating molecules.
I still have to figure out what it is, but at least I know it is not my SBATCH script or similar things.

Thanks for the informations.

For all future Gromacs users, in the end I have been able to understand why Gromacs acted in this strange way: I wanted to create a molecule using an alchemical transformation, therefore the molecule started in with no interactions with the rest of the system. As a starting position I used the result of an equilibration run of the same molecule in vacuum, therefore both the positions and the velocities in the gro file should have been stable. But as you already know I got segmentation faults and warning about some crazy rotations. The problem disappeared by removing the velocities from the gro file and letting gromacs recreate some random ones.
So if in the future you have got some segmentation faults in gromacs, that make no sense to you, try to remove velocities from the input gro file you are using.