Gracefully quit when job is cancelled

Jean-Francois.Burdet · October 15, 2020, 11:21am

Dear all,

My MPI applications are all configured to checkpoint and quit when some flag file (let’s call it ‘abort’) is found in the application’s working directory. Checkpointing is quick, but takes about 1-2 seconds.

I want my application to checkpoint/quit when it hits slurm wall time.

I tried using this following batch, but it doesn’t work : ‘abort’ file gets created, but all my processes already got killed, leaving them no time to checkpoint. How could I have some grace time to quit gracefully ? Should I trap another signal ? If yes which one ?

I’m asking, because as I understand this slurm message showing in the log when time wall is hit “srun: Job step aborted: Waiting up to 32 seconds for job step to finish.” it seems that some grace delay when terminating jobs should exists.

Thanks for your help,

JF

#!/bin/bash
#
#SBATCH -J sfdm
#SBATCH -e sfdm-error.e%j
#SBATCH -o sfdm-out.o%j
#
#
#SBATCH -p debug-EL7
#SBATCH --time=3:00
#SBATCH --ntasks=8

module purge
module load GCCcore/8.2.0 Singularity/3.4.0-Go-1.12

term_handler()
{
        echo "function term_handler called.  Exiting"
        touch $HOME/patient2/abort
        sleep 10
        exit -1
}

# associate the function "term_handler" with the TERM signal
trap 'term_handler' TERM

srun singularity run -B $HOME/scratch:/scratch -B $HOME/patient2:/biomed/mount $HOME/scratch/biomed-pub.simg stent sfdmsim

Yann.Sagon · October 15, 2020, 12:05pm

Hello,

I think this is what you want:

--signal=[B:]<sig_num>[@<sig_time>]
              When  a  job is within sig_time seconds of its end time, send it the signal sig_num.  Due to the resolution of event handling by Slurm, the signal may be sent up to 60 seconds earlier than specified.
              sig_num may either be a signal number or name (e.g. "10" or "USR1").  sig_time must have an integer value between 0 and 65535.  By default, no signal is sent before the job’s end time.  If a  sig_num
              is  specified  without  any  sig_time, the default time will be 60 seconds.  Use the "B:" option to signal only the batch shell, none of the other processes will be signaled. By default all job steps
              will be signaled, but not the batch shell itself.  To have the signal sent at preemption time see the preempt_send_user_signal SlurmctldParameter.

Jean-Francois.Burdet · October 27, 2020, 2:21pm

Ok, so after a few tries and a few holidays, I have something working.

I implemented in my code catching SIGUSR1, and doing checkpoint when receiving it. Trying to intercept it from sbatch bash script was too cumbersome.
Then I used the sbatch/srun option “–signal=SIGUSR1” to have my processes being sent SIGUSR1 one minute before time wall.
I had to take care a forwarding SIGUSR1 signal to child processes in my Singularity entry point bash script.

I updated my gitlab toy project to leave some documentation to the community :

. Gitlab project https://gitlab.com/jfburdet/mpi-sandbox
. Sample implementation au catching USR1 : https://gitlab.com/jfburdet/mpi-sandbox/-/blob/master/mpi_hello_world.c
. Sample singularity entry point with signal forwarding : https://gitlab.com/jfburdet/mpi-sandbox/-/blob/master/entry.sh