I have submitted several jobs on the shared-gpu. When running squeue, under the nodelist I get the following reason:
(ReqNodeNotAvail, UnavailableNodes:gpu[002-004,008-010])
Usually I don’t get this, it just writes (Priority) and waits until the job can run.
I checked the other jobs under squeue and nobody is using these nodes on the shared gpu at the moment (though other nodes are being used).
Is this normal behavior? I think I had this message already once in the past, but after some time it just went away and the jobs started running.
p.s.: I tried cancelling and resubmitting the jobs. Same thing happens. Here is the script I use with sbatch:
update: the reason under nodelist now changed to (Priority), but not yet running. So I guess there won’t be a problem and they will run soon. But still, I don’t understand what’s the difference between waiting in the queue with Priority and with ReqNodeNotAvail.
I think part of the problem is that GPU002-GPU004 are in DRAIN/DRAINING state. I think when all nodes are down, as opposed to being occupied with other jobs, you get the reason seen previously. So I would assume some of the other nodes that you have requested are now available, and thus the reason is now Priority.
However, the three nodes above are still down and have been for quite some time (I was experiencing the same behaviour as you, and was waiting to see if they came back online over night).
gpu[002-003] were put on DRAIN yesterday afternoon since I was expecting to add the 8x new RTX GPUs today (I am at UniDufour), unfortunately we received a parcel, but not these GPUs.
gpu[004] is in DRAIN since last Thursday given that the CPU frequency dropped after one PSU broke, we got the replacement in the parcel above and I am replacing it.
FYI, the reason and from when a specific node is in DRAIN is available to everyone:
capello@login2:~$ scontrol show Node=gpu002 | grep Reason
Reason=change-GPU [root@2020-06-09T14:53:22]
capello@login2:~$ scontrol show Node=gpu003 | grep Reason
Reason=change-GPU [root@2020-06-09T14:53:25]
capello@login2:~$ scontrol show Node=gpu004 | grep Reason
Reason=issues-1298 [root@2020-06-04T17:47:06]
capello@login2:~$