"Fair" GPU use in Bamboo

Hi, I have used bamboo for a couple of weeks and found the short queues quite a relief, and while I understand that it is a matter of time before more users migrate their research to bamboo and the queues fill up, I want to raise the issue of fair GPU usage, specially since it is quite easy for a single user (as it seems to be mostly the case at this very moment) to use all GPU nodes thus effectively blocking other people from doing any work.

My particular use case is: I am still developing and debugging my GPU code but I am unable to do a short (5min) test because other people are using all gpu nodes and have 40 or so jobs pending (as was the case yesterday). Shouldn’t there be a safeguard against this? Wouldn’t it be possible to at least ensure that one debug GPU will be available so the max wait time for a single test is 15 min?

I am aware there are fair use policies and limits on queued jobs (10k). It seems 10k pending jobs may be a good number for CPUs but not really for GPUs since it is far greater than the total number of GPUs available.

Thanks in advance for your help.

Dear @daniel.forerosanchez thanks for your feedback.

In fact, gpu001 is available from two partitions: debug-gpu and shared-gpu, which is obviously a problem. As GPUs are quite expensive, it is not a good idea to use the whole GPU node for debugging purposes. What we’ll try to do in the next maintenance is to split the node in two, i.e. reserve 4 of the 8 GPUs for debugging purposes.

In the meantime, you can use debug-gpu on Yggdrasil which is only dedicated to debugging.

Best

Hi, I am still unable to get some time on the debug gpu on Bamboo. I hoped it would be easier since the last maintenance but it doesn’t seem to be the case. Was the node split in the end? Which other solutions are available? Moving clusters this often is not really viable and makes it very hard to version control.

Thanks,

Hello,

Sorry, I am using a lot of resources on bamboo right now, I will kill my jobs.
About your job, are 32 CPUs necessary per GPU? lowering that number would increase your job priority, I assume.

Best regards

Thanks, it seems to make little difference to ask for less cores in any case. Still unable to get into the debug node but it seems they are down or something now.

Dear @daniel.forerosanchez sorry it is still in our long todo list :face_exhaling: