Getting more precisions about pending jobs

Quentin.Vagne · July 15, 2021, 9:07am

Hi, I have a question (out of curiosity) about pending jobs, especially when slurm says that NODELIST(REASON) is (Resources).

If I understood well, it means that there is not enough ressources available right now for the job to start. I’m wondering if there is a way to get more information about which ressource is limiting the job specifically. Is it memory? Total number of cpus asked? Or too many cpu-per-task?

It would be great if there would be a way to be able to know more precisely why a job is pending, so we can react and maybe adjust what we ask. Do you know if such an info can be found?

Thanks,
Quentin

Yann.Sagon · July 16, 2021, 12:13pm

Hi Quentin,

the reason is indeed not enough resources,i.e, you need to wait for another job to finish to have your job start. If the reason is priority, this means other jobs are in front of you in the queue.

So reason resources may be anything accountable (memory, gpu, cpu, license) I guess. Unless you asked a lot of memory per cpus or gpus, the reason is almost always related to the number of cpus you asked. If the cluster is full, even if you asked for one cpus your job will be pending.

If you ask for example to have a job with 20 cpus per task, this will force your job to request a compute node with at least 20 cpus, avoiding all the nodes with 12, 16 cpus. If your job can run with 12 or 16 cpus, it is better to ask for 12 cpus as you job will start faster. Less resource you ask, the faster your job will start.

More about this here:

Leon.VanGurp · July 28, 2021, 10:00am

To follow up on this, if I may. When I ask for more detailed info on a specific partition, for example shared bigmem, I ask for example sinfo -Nel --partition=shared-bigmem and get this as output:

Wed Jul 28 11:58:57 2021
NODELIST   NODES     PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
node056        1 shared-bigmem       mixed 16      2:8:1 256000   150000     30 E5-2660V none                
node154        1 shared-bigmem        idle 16      2:8:1 256000   150000     10 E5-2650V none                
node186        1 shared-bigmem   allocated 32      4:8:1 768000  4000000     30 E5-4640V none                
node203        1 shared-bigmem       mixed 28     2:14:1 512000  2500000     10 E5-2680V none                
node218        1 shared-bigmem        idle 8       2:4:1 512000   150000     10 E5-2637V none                
node219        1 shared-bigmem        idle 8       2:4:1 512000   150000     10 E5-2637V none                
node245        1 shared-bigmem        idle 20     2:10:1 256000   150000     10 E5-2630V none                
node246        1 shared-bigmem       mixed 20     2:10:1 256000   150000     10 E5-2630V none

Is there another way of asking this which shows how many nodes/memory are still available? This gets more relevant when you go to for example 4:8:1 configuration (like node 186) and it is listed as “mixed”

Quentin.Vagne · July 28, 2021, 11:38am

Hi, I found one command that shows information like that, but unfortunately it must be called for each node separately it seems:

scontrol show node node287

This returns interesting infos about how “busy” the node is:

CPUAlloc=7 CPUTot=128 CPULoad=7.13 etc…

EDIT: But wait, I’ve found something better. The sinfo command can be used to display the allocated/idle nodes :

sinfo -Ne --partition=shared-cpu --format="%N %C"

This gives a list of nodes with their names and the allocated/idle/other/total number of cpus !

Example of output for the shared-cpu partition:

NODELIST CPUS(A/I/O/T)
node005 7/9/0/16
node009 11/5/0/16
node010 6/10/0/16
node011 7/9/0/16
node012 10/6/0/16
node013 0/16/0/16
node014 5/11/0/16
node015 0/16/0/16
node016 0/16/0/16

etc...

Yann.Sagon · July 28, 2021, 12:29pm

Something like that?

[sagon@login2 ~] $ spart
                  QUEU STA   FREE  TOTAL RESORC  OTHER   FREE  TOTAL ||   MAX    DEFAULT    MAXIMUM  CORES   NODE    QOS
              PARTITIO TUS  CORES  CORES PENDNG PENDNG  NODES  NODES || NODES   JOB-TIME   JOB-TIME  /NODE MEM-GB   NAME
             debug-cpu   *     48     64      0      0      3      4 ||     2    15 mins    15 mins     16     64      -
public-interactive-cpu         10     16      0      0      0      1 ||     -     2 mins     8 hour     16     64 interactive
    public-longrun-cpu         10     16      0      0      0      1 ||     -     2 mins    14 days     16     64 longrun
            public-cpu        443    752    877      0      0     47 ||     -     1 mins     4 days     16     64      -
         public-bigmem          6     16      0      0      0      1 ||     1     1 mins     4 days     16    256      -
            shared-cpu       2010   4532 129711     49      1    188 ||     -     1 mins    12 hour     12     40      -
         shared-bigmem         42    148     48      0      0      8 ||     -     1 mins    12 hour      8    256      -
            shared-gpu        629   1316      1      0      0     20 ||     -     1 mins    12 hour     12    128      -

                  YOUR PEND PEND YOUR
                   RUN  RES OTHR TOTL
   COMMON VALUES:    0    0    0    0

See here hpc:slurm [eResearch Doc]

Leon.VanGurp · July 29, 2021, 10:17am

Hi Yann, this is really nice, but not node specific. I dont want to assign to a specific node, but it helps to know how many nodes are available before posting multiple jobs

I was thinking more along the lines of what Quentin suggested above, but a bit more extensive even. I hope this is helpful to others as well:

sinfo -Ne --partition=shared-bigmem --format="%N %.6a %.6t %.8z %.15C %.8O %.8m %.8e"

This yields:

NODELIST  AVAIL  STATE    S:C:T   CPUS(A/I/O/T) CPU_LOAD   MEMORY FREE_MEM
node056     up    mix    2:8:1       10/6/0/16    46.00   256000   158255
node154     up   idle    2:8:1       0/16/0/16     0.02   256000   237367
node186     up    mix    4:8:1      16/16/0/32    15.48   768000   495311
node203     up    mix   2:14:1      16/12/0/28    15.90   512000   414726
node218     up   idle    2:4:1         0/8/0/8     0.01   512000   419831
node219     up   idle    2:4:1         0/8/0/8     0.06   512000   493150
node245     up    mix   2:10:1       1/19/0/20    20.11   256000   192944
node246     up   idle   2:10:1       0/20/0/20     0.01   256000   224643

Edit: looking more at the spart output, this is also really useful to predict how long it will take before your job starts. Very insightful discussion this

Luca.Capello · August 3, 2021, 8:22am

Hi there,

If you are looking for node-oriented information, please see pestat (cf. [howto] check overall partition usage - #3 by Luca.Capello and hpc:slurm [eResearch Doc]):

capello@login2:~$ pestat -p shared-bigmem
Print only nodes in partition shared-bigmem
Hostname       Partition     Node Num_CPU  CPUload  Memsize  Freemem  Joblist
                            State Use/Tot              (MB)     (MB)  JobId User ...
 node056   public-bigmem    alloc  16  16    1.06*   256000   165568  48379605 saini7
 node154   shared-bigmem     idle   0  16    0.01    256000   234650
 node186   shared-bigmem    alloc  32  32    0.17*   768000   542652
 node203   shared-bigmem     idle   0  28    0.01    512000   471081
 node218   shared-bigmem     idle   0   8    0.04    512000   417186
 node219   shared-bigmem     idle   0   8    0.01    512000   490720
 node245   shared-bigmem      mix  15  20   15.04    256000   162063  48576297 liur 48576272 liur 48576274 liur [...]
 node246   shared-bigmem     idle   0  20    0.02    256000   222061
capello@login2:~$

Thx, bye,
Luca

Leon.VanGurp · September 1, 2021, 9:24am

so much more elegant, haha, thanks!