Question about mono-EL7 and shared-EL7 partitions usage

Pablo.Strasser · August 3, 2020, 1:44pm

Hi,
A more useful command to view and verify your job is the squeue -u USERNAME for example for you it will be squeue -u bolmonte . This will show a list of all your job together with the node they are scheduled on and reason for not starting. In addition the --start flag allows to obtain the scheduled time of when your job start. This is a worst case scenario time and your job may start sooner if other jobs finish early.

Yann has given some explanation of the scheduler (Job priority explanation and Good usage of resources on HPC cluster ).

In summary Slurm will apply a priority on every job and will schedule them on nodes. A simplified list of rules are the following (Read documentation for the full rules):

Once a job is scheduled on a node (–start give a time) the job may only start earlier, never later (except in case of unexpected shutdown of the cluster).
When holes (amount of time where there are resources available for some time) are available on a node jobs that fit into the holes will be scheduled into it by priority. Note that in no case can theses jobs affect previously scheduled jobs to be considered a job muss fill into the hole.

There are mainly two explanations of why jobs are not scheduled:

There are no node currently available with enough CPU, memory or GPU resources available. The status of node can be found with sinfo
None of the jobs to schedule fit in the holes because they are planned to conflict with another planned job which is already scheduled.

The best way to maximize the launch of your job is to ensure that the resources you ask for your job (CPU,memory,time and GPU) are higher than what you need (to avoid crashes) but not too high to maximize the scheduling which would not only give more resources to other people but also allow your job to start earlier. To estimate the needed time a short test run on the debug partition can help.