I have questions about priority and resource availability. Concretely, my jobs don’t start and I don’t know why.
My personal story:
Yesterday I used the cluster (gpu nodes of shared-gpu partition because my private partition wasn’t “idle”) for approximately 4-6h in total (if I remember correctly).
Today I wanted to run 18jobs of 30min on ampere gpus with double precision constraint. But after few hours none of these jobs are running while the gpu020 of “my” private partition was idle, same thing for gpu[026-028] of the shared-gpu.
When I type squeue --me the reasons are Priority and Resources.
Hence I don’t know if it is a bug or the reason for “resource”.
Also I have questions about the priority:
Does it take into account the time requested or only time of usage ?
Does it take into account the FLOPS of the gpus ? or a certain amount of time on a P100 and A100 has the same impact on my priority ?
What if a job is cancel before execution ? and during execution ?
Sorry for these questions but this “priority” thing is a bit mysterious for me, then I don’t understand why sometime my job are starting immediately or pending for long time.
This is not an official answer or anything but I remember having similar questions before and if you want you can check out this post : questions about pending jobs
There are some commands you can use to check more precisely how “free” a partition is and what might be limiting your job. One thing could be asking too much memory for instance.
Hi,
Thank you !
In my case it is a bit different because I only need GPU.
It seems I can’t start a job on private-kruse-gpu partition.
I submitted a 30min job 14h ago, it doesn’t start…
The two nodes were free (most of time).
I use the pestat -p of your previous post. I get information about the state idle, the number of cpu use 0, and the CPUload, this number is red 0.75* for gpu020 and 0.88* for gpu031.
It would be good if the negative impact on priority for P100 can be set to 4-8 times smaller than the one for A100, because the P100 are way slower… It is why I don’t use the P100 even if I could. I think the P100 should almost be “free” (~1/10) compare to A100, otherwise people will not use it (at least I will not).
I have absolutely no idea if this kind of priority weighting is possible.