[tutorial] How to automatically restart a slurm job after time limit

Yann.Sagon · October 22, 2019, 9:57am

As there is a maximum allowed wall time on Baobab, 12h for example on the shared partitions, it may be useful to be able to restart a job multiple time. For this, we can rely on slurm dependencies.

Example with a fictif job whose task is to write lines on a file. We don’t know how many time we need to launch the job, but we decide that the job is terminated once three lines are written on the result file. We then submit many time the job, in our case 10 time.

The trick is to use the slurm option --dependency=singleton to ensure that only one instance of the job is running at the same time. To identify the job as identical, slurm check the job name and owner.

Once the job is finished, we cancel all the remaining jobs.

Example sbatch script:

#!/bin/sh

#SBATCH --time=15:00
#SBATCH --partition=debug-EL7
#SBATCH --job-name=test-restart
#SBATCH --dependency=singleton


OUTPUT=res

# if the result file doesn't exist, we create it.
if [ ! -f $OUTPUT ]
then
   touch $OUTPUT
fi

# we do the job: write one line to the result file!
srun echo "I'm job $SLURM_JOB_ID" >> $OUTPUT

# we check if the full job is finished
# we say that once the result file contains three lines, it's the case.
NB_LINES=$(wc -l < $OUTPUT)
if [ $NB_LINES -eq 3 ]
then
   echo "Job finished, cancel the remaining."
   scancel --jobname $SLURM_JOB_NAME
   exit
fi

We can launch the job n times (10 time in the example):

[sagon@login2 restart] $ for i in {1..10} ; do sbatch run.sh $i; done
Submitted batch job 21584187
Submitted batch job 21584188
Submitted batch job 21584189
Submitted batch job 21584190
Submitted batch job 21584191
Submitted batch job 21584192
Submitted batch job 21584194
Submitted batch job 21584198
Submitted batch job 21584200
Submitted batch job 21584201
[sagon@login2 restart] $ ls -la
total 3
drwxr-xr-x  2 sagon unige   5 Oct 22 11:19 .
drwxr-xr-x 17 sagon unige  20 Oct 22 09:00 ..
-rw-r--r--  1 sagon unige  51 Oct 22 11:19 res
-rw-r--r--  1 sagon unige 595 Oct 22 11:15 run.sh
-rw-r--r--  1 sagon unige   0 Oct 22 11:19 slurm-21584187.out
-rw-r--r--  1 sagon unige   0 Oct 22 11:19 slurm-21584188.out
-rw-r--r--  1 sagon unige  36 Oct 22 11:19 slurm-21584189.out
[sagon@login2 restart] $ cat res 
I'm job 21584187
I'm job 21584188
I'm job 21584189
[sagon@login2 restart]

You can see that the job was running three times. It’s your dutty to use the checkpoint/restart file of your job for this to work.

Bonus: if you can trigger your job to write a checkpoint when the timelimit is reached, you can use the slurm flag --signal= to send a custom signal to your software before the job is actually killed.

Pablo.Strasser · October 22, 2019, 10:45am

Thanks very useful. This was something that was in my todo list of things to add in my code.