Dear all,
Today, I launched some jobs on the partitions mono-EL7 and shared-EL7.
When I do “squeue -p shared-EL7”, I see only one job running, when there are 400 cores available. Is that normal?
In a less impressive way, but still, I see that on mono-EL7, there is only ~100 nodes used. It seems that not all ressources are used/available. Is that right?
Thank you for your help!
Emeline
PS: When will be able to use Yggdrasil?
Hi,
A more useful command to view and verify your job is the squeue -u USERNAME
for example for you it will be squeue -u bolmonte
. This will show a list of all your job together with the node they are scheduled on and reason for not starting. In addition the --start
flag allows to obtain the scheduled time of when your job start. This is a worst case scenario time and your job may start sooner if other jobs finish early.
Yann has given some explanation of the scheduler (Job priority explanation and Good usage of resources on HPC cluster ).
In summary Slurm will apply a priority on every job and will schedule them on nodes. A simplified list of rules are the following (Read documentation for the full rules):
- Once a job is scheduled on a node (–start give a time) the job may only start earlier, never later (except in case of unexpected shutdown of the cluster).
- When holes (amount of time where there are resources available for some time) are available on a node jobs that fit into the holes will be scheduled into it by priority. Note that in no case can theses jobs affect previously scheduled jobs to be considered a job muss fill into the hole.
There are mainly two explanations of why jobs are not scheduled:
- There are no node currently available with enough CPU, memory or GPU resources available. The status of node can be found with
sinfo
- None of the jobs to schedule fit in the holes because they are planned to conflict with another planned job which is already scheduled.
The best way to maximize the launch of your job is to ensure that the resources you ask for your job (CPU,memory,time and GPU) are higher than what you need (to avoid crashes) but not too high to maximize the scheduling which would not only give more resources to other people but also allow your job to start earlier. To estimate the needed time a short test run on the debug partition can help.
Little remark I forgot to add. squeue -p shared-EL7
is very misleading to find real usage of nodes, because the partitions in baobab are into one another the same nodes can be found in other partitions. It is completely possible to have no job running on shared-EL7 but maybe there are job running on parallel-EL7, mono-EL7 or mono-shared-EL7.
Move this question to a new topic.
Thank you @Pablo.Strasser for this detailed reply !
@Emeline.Bolmont does this answer your question ?
Regarding Yggdrasil, we are working on it… we will let you know through the mailing-list when it will be ready for beta testing, but don’t expect too much before the next couple of weeks !
Yes, thank you!
I think it answers my question.
Ok for Yggdrasil, looking forward to using it!
Thanks for your work! Have a good summer!