[Suggestion] Quick overview of gpu usage for nodes and partitions

Hi,
Would it be possible to have some command line or web interface that allows to see quickly how many gpus are free for a given node and partition.
Something similar to the salloc command which is not very useful for gpus as it will indicated mixed usage even if all the gpus of the node are used.
In addition if its possible to have some additional information for the person launching the job like maximum gpu memory usage (as given by nvidia-smi) it would be nice.

Hello,

maybe this tool could achieve what you are looking for.

Example output:

[sagon@master pestat] $ ./pestat -w gpu[001-015] -G
Select only nodes in hostlist=gpu[001-015]
GRES (Generic Resource) is printed after each jobid
Hostname       Partition     Node Num_CPU  CPUload  Memsize  Freemem  GRES/   Joblist
                            State Use/Tot              (MB)     (MB)  node    JobId User GRES/job ...
  gpu002      shared-gpu      mix   6  12    2.82*   258215   242688  gpu:titan:3 18880364 ramapur0 gpu:1 18880343 ramapur0 gpu:1 18880342 ramapur0 gpu:1  
  gpu003      shared-gpu      mix   6  12    2.76*   258215   243408  gpu:titan:3 18880365 ramapur0 gpu:1 18880344 ramapur0 gpu:1 18880341 ramapur0 gpu:1  
  gpu004      shared-gpu      mix   8  20    4.05*   128941   112504  gpu:pascal:6 18880354 ramapur0 gpu:1 18880355 ramapur0 gpu:1 18880356 ramapur0 gpu:1 18880357 ramapur0 gpu:1  
  gpu005      shared-gpu      mix   8  20    3.90*   128941   111722  gpu:pascal:5 18880358 ramapur0 gpu:1 18880359 ramapur0 gpu:1 18880360 ramapur0 gpu:1 18880361 ramapur0 gpu:1  
  gpu006      shared-gpu      mix   9  20    6.40*   128941   112404  gpu:pascal:8 18880363 ramapur0 gpu:1 18880346 ramapur0 gpu:1 18880347 ramapur0 gpu:1  
  gpu007      shared-gpu      mix   8  20    3.83*   258215   240136  gpu:pascal:4 18880350 ramapur0 gpu:1 18880351 ramapur0 gpu:1 18880352 ramapur0 gpu:1 18880353 ramapur0 gpu:1  
  gpu008  shared-gpu-EL7      mix   6  20    2.79*   256000   243154  gpu:titan:8  
  gpu009  shared-gpu-EL7    alloc  20  20   19.09*   256000   250173  gpu:titan:8 18826767 sosnowsp gpu:4  
  gpu010  shared-gpu-EL7    alloc  20  20   19.27*   256000   250081  gpu:titan:8 18826767 sosnowsp gpu:4

Yeah this look very much what I had in mind. Indicating who use a GPUs and how many are used.
I think adding that tool to the cluster would be nice.

Thanks in advance.

We will probably install it “one day” if it’s useful to others as well. In the meantime, feel free to install it to your home directory.

Best

Hi there,

FWIW pestat has been available on the login* nodes since Apr 2020 (cf. [howto] check overall partition usage - #3 by Luca.Capello ).

Thx, bye,
Luca