As there is a maximum allowed wall time on Baobab, 12h for example on the shared partitions, it may be useful to be able to restart a job multiple time. For this, we can rely on slurm dependencies.
Example with a fictif job whose task is to write lines on a file. We don’t know how many time we need to launch the job, but we decide that the job is terminated once three lines are written on the result file. We then submit many time the job, in our case 10 time.
The trick is to use the slurm option --dependency=singleton
to ensure that only one instance of the job is running at the same time. To identify the job as identical, slurm check the job name and owner.
Once the job is finished, we cancel all the remaining jobs.
Example sbatch script:
#!/bin/sh
#SBATCH --time=15:00
#SBATCH --partition=debug-EL7
#SBATCH --job-name=test-restart
#SBATCH --dependency=singleton
OUTPUT=res
# if the result file doesn't exist, we create it.
if [ ! -f $OUTPUT ]
then
touch $OUTPUT
fi
# we do the job: write one line to the result file!
srun echo "I'm job $SLURM_JOB_ID" >> $OUTPUT
# we check if the full job is finished
# we say that once the result file contains three lines, it's the case.
NB_LINES=$(wc -l < $OUTPUT)
if [ $NB_LINES -eq 3 ]
then
echo "Job finished, cancel the remaining."
scancel --jobname $SLURM_JOB_NAME
exit
fi
We can launch the job n times (10 time in the example):
[sagon@login2 restart] $ for i in {1..10} ; do sbatch run.sh $i; done
Submitted batch job 21584187
Submitted batch job 21584188
Submitted batch job 21584189
Submitted batch job 21584190
Submitted batch job 21584191
Submitted batch job 21584192
Submitted batch job 21584194
Submitted batch job 21584198
Submitted batch job 21584200
Submitted batch job 21584201
[sagon@login2 restart] $ ls -la
total 3
drwxr-xr-x 2 sagon unige 5 Oct 22 11:19 .
drwxr-xr-x 17 sagon unige 20 Oct 22 09:00 ..
-rw-r--r-- 1 sagon unige 51 Oct 22 11:19 res
-rw-r--r-- 1 sagon unige 595 Oct 22 11:15 run.sh
-rw-r--r-- 1 sagon unige 0 Oct 22 11:19 slurm-21584187.out
-rw-r--r-- 1 sagon unige 0 Oct 22 11:19 slurm-21584188.out
-rw-r--r-- 1 sagon unige 36 Oct 22 11:19 slurm-21584189.out
[sagon@login2 restart] $ cat res
I'm job 21584187
I'm job 21584188
I'm job 21584189
[sagon@login2 restart]
You can see that the job was running three times. It’s your dutty to use the checkpoint/restart file of your job for this to work.
Bonus: if you can trigger your job to write a checkpoint when the timelimit is reached, you can use the slurm flag --signal=
to send a custom signal to your software before the job is actually killed.