[howto] check overall partition usage

Luca.Capello · April 24, 2020, 1:16pm

Hi there,

we have just added to the login node spart , a new tool to check the overall partition usage/description, (cf. https://github.com/mercanca/spart ).

Here the default output:

capello@login2:~$ spart
            QUEUE STA   FREE  TOTAL RESORC  OTHER   FREE  TOTAL ||   MAX    DEFAULT    MAXIMUM  CORES   NODE 
        PARTITION TUS  CORES  CORES PENDNG PENDNG  NODES  NODES || NODES   JOB-TIME   JOB-TIME  /NODE MEM-GB 
        debug-EL7   *     64     64      0      8      4      4 ||     2    15 mins    15 mins     16     64 
         mono-EL7        131    784   3256     24      0     49 ||     -     1 mins     4 days     16     64 
     parallel-EL7        131    784    932    114      0     49 ||     -     1 mins     4 days     16     64 
       shared-EL7        324   3584   1235      9      4    225 ||     -     1 mins    12 hour     12     40 
  mono-shared-EL7        324   3584   1112      0      4    225 ||     -     1 mins    12 hour     12     40 
       bigmem-EL7          9     16     60      0      0      1 ||     1     1 mins     4 days     16    256 
shared-bigmem-EL7         39    212    271      0      0     10 ||     -     1 mins    12 hour      8    256 
   shared-gpu-EL7        182    228     16      0      4     10 ||     -     1 mins    12 hour     12    128 
        admin-EL7   g     16     16      0      0      1      1 ||     -     1 mins     7 days     16     64 

                  YOUR YOUR YOUR YOUR 
                   RUN PEND OTHR TOTL 
   COMMON VALUES:    0    0    0    0 
capello@login2:~$

The -h option gives you plenty of explanation and other options to tweak the information reported.

Thx, bye,
Luca

Luca.Capello · April 27, 2020, 12:08pm

A post was split to a new topic: Automatic Slurm partition selection

Luca.Capello · April 28, 2020, 7:44pm

Hi there,

And we have just added pestat as well (cf. Slurm_tools/pestat at master · OleHolmNielsen/Slurm_tools · GitHub ), from the upstream “Slurm related software” section (cf. Slurm Workload Manager - Download Slurm ).

pestat is another tool to check cluster usage, this time focusing on single nodes. @Yann.Sagon and @Pablo.Strasser have already talked about on this forum (cf. [Suggestion] Quick overview of gpu usage for nodes and partitions - #2 by Yann.Sagon and [Suggestion] Reserve one core per gpu on gpu nodes , respectively).

The default output including all ~250 nodes, here few lines:

capello@login2:~$ pestat | \
 head -n 20                                                                                                           
Hostname       Partition     Node Num_CPU  CPUload  Memsize  Freemem  Joblist
                            State Use/Tot              (MB)     (MB)  JobId User ...
  gpu002   dpnc-gpu-EL7+      mix   3  12    3.01    257820   232204  32558797 salamda0 32585454 drozd 32585438 drozd  
  gpu003   dpnc-gpu-EL7+      mix   3  12    3.02    257820   234949  32571724 salamda0 32587282 krivachy 32585455 drozd  
  gpu004  shared-gpu-EL7      mix   6  20    6.01    128820   121712   
  gpu005  shared-gpu-EL7      mix   4  20    3.98    128820   102896  32585432 drozd 32585429 drozd 32585428 drozd 32585427 drozd  
  gpu006  shared-gpu-EL7      mix   8  20    8.02    128820   120871   
  gpu007  shared-gpu-EL7    alloc  20  20   13.60*   257840   230270  32585443 drozd 32585442 drozd 32585440 drozd 32585439 drozd  
  gpu008  shared-gpu-EL7      mix   8  20    9.74*   256000   225026  32577812 krivachy 32577822 krivachy 32577832 krivachy 32577842 krivachy 32577852 krivachy 32584204 krivachy 32585453 drozd 32585449 drozd  
  gpu009  shared-gpu-EL7      mix   8  20    8.00    256000   206983  32585450 drozd 32585436 drozd 32585437 drozd 32585435 drozd 32585433 drozd 32585434 drozd 32585431 drozd 32585430 drozd  
  gpu010  shared-gpu-EL7      mix   8  20    8.01    256000   210982  32585451 drozd 32585452 drozd 32585448 drozd 32585447 drozd 32585444 drozd 32585445 drozd 32585446 drozd 32585441 drozd  
  gpu011  shared-gpu-EL7     idle   0  64    0.01    256000   253407   
 node001      debug-EL7*     idle   0  16    0.01     64000    47962   
 node002      debug-EL7*     idle   0  16    0.01     64000    58047   
 node003      debug-EL7*     idle   0  16    0.01     64000    52885   
 node004      debug-EL7*     idle   0  16    0.01     64000    55427   
 node005 mono-shared-EL7      mix   3  16    3.40     64000    15341  32587732 blanchme 32593378 drozd 32593379 drozd  
 node007 mono-shared-EL7      mix  13  16   19.80*    64000    40268  32592912 weijiah7 32584737 drozd 32584738 drozd 32584739 drozd 32592853 drozd 32592765 drozd 32593395 blanchme 32593193 blanchme  
 node008       mono-EL7+      mix   8  16    5.01*    64000    56183  32578161 proix 32589019 blanchme 32589020 blanchme 32589021 blanchme  
 node009       mono-EL7+    alloc  16  16   16.02     64000    46580  32560799 cantoni 32591518 blanchme 32591519 blanchme 32591520 blanchme 32591521 blanchme 32591522 blanchme 32591523 blanchme 32584795 drozd 32584796 drozd 32584797 drozd 32584798 drozd 32584799 drozd 32584740 drozd 32584741 drozd 32584742 drozd 32584743 drozd  
capello@login2:~$

Again, the -h option gives you plenty of explanation and other options to tweak the information reported.

Thx, bye,
Luca

Genevieve.Savard · June 5, 2023, 3:49pm

Hi, I like using spart to check usage. I just tried to use it on Yggdrasil and I get Segmentation fault (core dumped). Any idea why?

Gael.Rossignol · June 6, 2023, 8:30am

Dear Genevieve,

This is a know issue of the last revision onf slurm. We already send the information to the provider and we are waiting for a fix in the future versions.

Details are loggued on this page : https://github.com/mercanca/spart/issues/17

Best regards,