This is a quick post to try to explain the implication of submitting a job from within a sbatch script. I’m posting this because I had a question by email related to this topic.
Let say we have a master script master.sh
that calls a slave script slave.sh
.
The master script is doing almost nothing, only submitting slave script(s).
master.sh =>
#!/bin/sh
#SBATCH --cpus-per-task=2
slave.sh (no srun in front of it if you have srun in the slave.sh script!)
slave.sh =>
#!/bin/sh
#SBATCH --cpus-per-task=4
srun echo $(hostname)
We submit master script using sbatch:
[sagon@master sruninsidesbatch] $ sbatch master.sh
Submitted batch job 49500150
[sagon@master sruninsidesbatch] $ ls -la slurm-49500150.out
-rw-r--r-- 1 sagon unige 1535 Aug 20 10:13 slurm-49500150.out
In this case, the slave script is a job step of the master job. Once the step finishes, the master job is finished as well and releases the resources. The sbatch
pragma #SBATCH --cpus-per-task=4
isn’t used as we aren’t submitting the slave script using sbatch
.
A job step is executed inside the resources allocated by the master script. It is impossible to request more resources from the job step.
Example, we try to us 4 cpus in a job who requested only 2 cpus.
slave.sh =>
#!/bin/sh
srun -c 4 echo $(hostname)
[sagon@master sruninsidesbatch] $ sbatch master.sh
Submitted batch job 49500168
[sagon@master sruninsidesbatch] $ cat slurm-49500168.out
srun: Job step's --cpus-per-task value exceeds that of job (4 > 2). Job step may never run.
srun: error: Unable to create step for job 49500168: More processors requested than permitted
We can as well submit a sbatch from the master script:
master.sh =>
#!/bin/sh
#SBATCH --cpus-per-task=2
sbatch slave.sh
slave.sh =>
#!/bin/sh
#SBATCH --cpus-per-task=4
srun -c 4 echo $(hostname)
[sagon@master sruninsidesbatch] $ sbatch master.sh
Submitted batch job 49500173
[sagon@master sruninsidesbatch] $ cat slurm-49500173.out
Submitted batch job 49500174
As you can see, this results in two separated jobs with two distinct resources allocations. The second job allocated 4 CPUs, and the first one 2 CPUs.
If you need to have a master job controlling some slave jobs, it is better not to use SLURM for the master.
Other resources: