Responsible GPU Usage

Matthew.Leigh · September 2, 2022, 11:18am

Hi HPC team,

I just wanted to query some of the best practices when requesting many gpus for large job arrays.
A few of the users on the cluster request and are granted massive amounts of resources and block many others from using the cluster for days at a time.

I understand the logic that an idle gpu is a wasted gpu, but perhaps we could strike a better balance than what is currently taking place.

My main issue with this is that the GPU nodes are themselves not being utilized.

As an example, all of GPUs of gpu017 are right now in use and have been so for the past 10 hours.

Using nvidia-smi we can see that around 15% of compute is actually being used and only 300Mb of these 24Gb cards are being allocated. These jobs on gpu017 are a few of 400 in the job array by a single user that often occupies the cluster for weeks at a time. I understand that there is a priority queue at play, but if you miss the window between one of these jobs finishing and the next starting it is another ~12 hour wait.

If the concern is that resources are being wasted please can we look at this usage of the cluster.

Yann.Sagon · September 9, 2022, 7:31am

Hi,

thanks for sharing your thoughts with us.

Well, I’m checking right now on gpu017 and you are using all the GPUs on it for 7 days. What do you suggest that we do with those jobs?

It is true that your jobs are using more intensively the GPUs than what you showed on the screenshot. Anyway, there is no point to prevent a user to use a GPU because the job isn’t using it a maximum capacity. What would be good is to ensure that a user with low GPU resources needs could use low end GPUs, but this isn’t easy to setup as there is no option to request GPU memory in SLURM right now.

By the way, a job array isn’t seen as a single job: each job has a single lifecycle and priority. It means there is no such 12 time slot that you may miss. If you have enough priority your job will start as soon as one of the job in the job array finishes.

About wasting resources: this is true for CPU and memory as well. I checked your jobs: you request 12CPUs and 16GB RAM per task and use in fact 1CPU and ~4GB. But in fact this isn’t a real issue as other user are anyway prevented to use the compute node as all the GPUs are used.

edit: I asked Slurm editor schedmd if there is a best practice in gpu allocation based on the job type.