Hi,
Would it be possible to have some command line or web interface that allows to see quickly how many gpus are free for a given node and partition.
Something similar to the salloc command which is not very useful for gpus as it will indicated mixed usage even if all the gpus of the node are used.
In addition if its possible to have some additional information for the person launching the job like maximum gpu memory usage (as given by nvidia-smi) it would be nice.
Hello,
maybe this tool could achieve what you are looking for.
Example output:
[sagon@master pestat] $ ./pestat -w gpu[001-015] -G
Select only nodes in hostlist=gpu[001-015]
GRES (Generic Resource) is printed after each jobid
Hostname Partition Node Num_CPU CPUload Memsize Freemem GRES/ Joblist
State Use/Tot (MB) (MB) node JobId User GRES/job ...
gpu002 shared-gpu mix 6 12 2.82* 258215 242688 gpu:titan:3 18880364 ramapur0 gpu:1 18880343 ramapur0 gpu:1 18880342 ramapur0 gpu:1
gpu003 shared-gpu mix 6 12 2.76* 258215 243408 gpu:titan:3 18880365 ramapur0 gpu:1 18880344 ramapur0 gpu:1 18880341 ramapur0 gpu:1
gpu004 shared-gpu mix 8 20 4.05* 128941 112504 gpu:pascal:6 18880354 ramapur0 gpu:1 18880355 ramapur0 gpu:1 18880356 ramapur0 gpu:1 18880357 ramapur0 gpu:1
gpu005 shared-gpu mix 8 20 3.90* 128941 111722 gpu:pascal:5 18880358 ramapur0 gpu:1 18880359 ramapur0 gpu:1 18880360 ramapur0 gpu:1 18880361 ramapur0 gpu:1
gpu006 shared-gpu mix 9 20 6.40* 128941 112404 gpu:pascal:8 18880363 ramapur0 gpu:1 18880346 ramapur0 gpu:1 18880347 ramapur0 gpu:1
gpu007 shared-gpu mix 8 20 3.83* 258215 240136 gpu:pascal:4 18880350 ramapur0 gpu:1 18880351 ramapur0 gpu:1 18880352 ramapur0 gpu:1 18880353 ramapur0 gpu:1
gpu008 shared-gpu-EL7 mix 6 20 2.79* 256000 243154 gpu:titan:8
gpu009 shared-gpu-EL7 alloc 20 20 19.09* 256000 250173 gpu:titan:8 18826767 sosnowsp gpu:4
gpu010 shared-gpu-EL7 alloc 20 20 19.27* 256000 250081 gpu:titan:8 18826767 sosnowsp gpu:4
Yeah this look very much what I had in mind. Indicating who use a GPUs and how many are used.
I think adding that tool to the cluster would be nice.
Thanks in advance.
We will probably install it “one day” if it’s useful to others as well. In the meantime, feel free to install it to your home directory.
Best
Hi there,
FWIW pestat
has been available on the login* nodes since Apr 2020 (cf. [howto] check overall partition usage - #3 by Luca.Capello ).
Thx, bye,
Luca