Running MPI script with sbatch

Simone.Bavera · August 2, 2021, 12:39pm

Hello HPC team,

I am currently trying to run one script with MPI on the debug-cpu partition of Yggdrasil with 2 nodes and 36 cores each. I am facing some problems which I hope someone can help me solve.

The script is located at /srv/beegfs/scratch/shares/astro/posydon/harmrad/submit.sh and works with the newly installed module intel/2021a. When I run it, the sbatch call

sbatch -n $NTOT -N $NODE --ntasks-per-node=$CORENUM --cpus-per-task=1 -p $partition -t $T -J $Job --exclusive --mail-type=END,FAIL --mail-user=simone.bavera@unige.ch --constraint="V9" ./harmrad.sub $partition $NCPUX1 $NCPUX2 $NCPUX3 $restart $CORENUM $dir

configured with

SLURM_NTASKS          : 72
SLURM_JOB_NUM_NODES   : 2
SLURM_CPUS_PER_TASK   : 1
SLURM_CPUS_ON_NODE    : 36

is supposed to run one simulation with MPI on the two nodes with a total of 72 cores. However, what happens is that the script is running 36 MPI task with one core for each node. The MPI library is not the problem as I can run the tasks on the single corses. Also, I know that the script should work fine as it is adopted from a collaborator who uses it without this issue. The only difference is that this collaborator uses it on a private node. So we suspect that the issue might be rising from the fact that the nodes on Yggdrasil are shared between the users, though even after adding the slurm option --exclusive the code does not seems to be running correctly.

Here I provide the output of htop while running the mentioned script.

While here is a screenshot from my collaborator running the same script on his institution’s cluster.

Thank you for the help!

Simone

Luca.Capello · August 4, 2021, 3:11pm

Hi there,

Simone.Bavera:

The script is located at /srv/beegfs/scratch/shares/astro/posydon/harmrad/submit.sh and works with the newly installed module intel/2021a. When I run it, the sbatch call
sbatch -n $NTOT -N $NODE --ntasks-per-node=$CORENUM --cpus-per-task=1 -p $partition -t $T -J $Job --exclusive --mail-type=END,FAIL --mail-user=simone.bavera@unige.ch --constraint="V9" ./harmrad.sub $partition $NCPUX1 $NCPUX2 $NCPUX3 $restart $CORENUM $dir
configured with
SLURM_NTASKS          : 72
SLURM_JOB_NUM_NODES   : 2
SLURM_CPUS_PER_TASK   : 1
SLURM_CPUS_ON_NODE    : 36
is supposed to run one simulation with MPI on the two nodes with a total of 72 cores. However, what happens is that the script is running 36 MPI task with one core for each node.

First, please always provide the Slurm Job number for easy tracking.

Second, your sbatch call is missing the srun command, as described in the UNIGE internal documentation (cf. hpc:slurm [eResearch Doc]) and upstream Slurm Intel-MPI one (cf. Slurm Workload Manager - MPI Users Guide).

Does your collaborator use the same sbatch call? In other words, what have you adopted from the original script by your collaborator?

Indeed, when Slurm allocates resources for a job, these resources are not shared between users, thus the --exclusive option only changes the fact that the remaining resources on the nodes you got assigned will not be used for some other users.

The above does not mean that your job can use all the resources of the assigned nodes, you should explicitly ask for them.

NB, the job above was run by user fragkos, not by yourself, and still no Slurm Job number.

EDIT 2021-08-05 17:10, after private discussions with @Simone.Bavera, the following statements are wrong, my fault, sorry!

Now, what I would change to your sbatch call (cf, Slurm Workload Manager - sbatch):

-n $NTOT => --ntasks, thus the number of MPI worker in total, in your cases this must be 2
-N $NODE => --nodes, either you specify it as 2 or, even better, you let Slurm calculates it according to options --ntasks and --cpus-per-task
--ntasks-per-node=$CORENUM => either you specify it as 1 or, even better, you do not specify it at all since --ntasks takes precedence
--cpus-per-task=1 => this should be 36 instead

Thx, bye,
Luca