I am currently trying to run one script with MPI on the debug-cpu partition of Yggdrasil with 2 nodes and 36 cores each. I am facing some problems which I hope someone can help me solve.
The script is located at /srv/beegfs/scratch/shares/astro/posydon/harmrad/submit.sh and works with the newly installed module intel/2021a. When I run it, the sbatch call
is supposed to run one simulation with MPI on the two nodes with a total of 72 cores. However, what happens is that the script is running 36 MPI task with one core for each node. The MPI library is not the problem as I can run the tasks on the single corses. Also, I know that the script should work fine as it is adopted from a collaborator who uses it without this issue. The only difference is that this collaborator uses it on a private node. So we suspect that the issue might be rising from the fact that the nodes on Yggdrasil are shared between the users, though even after adding the slurm option --exclusive the code does not seems to be running correctly.
Here I provide the output of htop while running the mentioned script.
Does your collaborator use the same sbatch call? In other words, what have you adopted from the original script by your collaborator?
Indeed, when Slurm allocates resources for a job, these resources are not shared between users, thus the --exclusive option only changes the fact that the remaining resources on the nodes you got assigned will not be used for some other users.
The above does not mean that your job can use all the resources of the assigned nodes, you should explicitly ask for them.
NB, the job above was run by user fragkos, not by yourself, and still no Slurm Job number.
EDIT 2021-08-05 17:10, after private discussions with @Simone.Bavera, the following statements are wrong, my fault, sorry!