ReqNodeNotAvail - normal behavior?

Dear community,

I have submitted several jobs on the shared-gpu. When running squeue, under the nodelist I get the following reason:
(ReqNodeNotAvail, UnavailableNodes:gpu[002-004,008-010])
Usually I don’t get this, it just writes (Priority) and waits until the job can run.

I checked the other jobs under squeue and nobody is using these nodes on the shared gpu at the moment (though other nodes are being used).

Is this normal behavior? I think I had this message already once in the past, but after some time it just went away and the jobs started running.

p.s.: I tried cancelling and resubmitting the jobs. Same thing happens. Here is the script I use with sbatch:

#!/bin/sh


#SBATCH --cpus-per-task=1
#SBATCH --job-name=v3primal
#SBATCH --ntasks=1
#SBATCH --time=11:59:00
#SBATCH --output=slurm-%J.out
#SBATCH --gres=gpu:titan:1
#SBATCH --constraint="V5|V6"
#SBATCH --partition=shared-gpu-EL7

module load fosscuda/2019b TensorFlow/2.0.0-Python-3.7.4 SciPy-bundle/2019.10-Python-3.7.4

srun python run.py

update: the reason under nodelist now changed to (Priority), but not yet running. So I guess there won’t be a problem and they will run soon. But still, I don’t understand what’s the difference between waiting in the queue with Priority and with ReqNodeNotAvail.

Hi Tamas,

I think part of the problem is that GPU002-GPU004 are in DRAIN/DRAINING state. I think when all nodes are down, as opposed to being occupied with other jobs, you get the reason seen previously. So I would assume some of the other nodes that you have requested are now available, and thus the reason is now Priority.

However, the three nodes above are still down and have been for quite some time (I was experiencing the same behaviour as you, and was waiting to see if they came back online over night).

Johnny

1 Like

Hi there,

That is right.

To be clearer:

  1. gpu[002-003] were put on DRAIN yesterday afternoon since I was expecting to add the 8x new RTX GPUs today (I am at UniDufour), unfortunately we received a parcel, but not these GPUs.
  2. gpu[004] is in DRAIN since last Thursday given that the CPU frequency dropped after one PSU broke, we got the replacement in the parcel above and I am replacing it.

FYI, the reason and from when a specific node is in DRAIN is available to everyone:

capello@login2:~$ scontrol show Node=gpu002 | grep Reason
   Reason=change-GPU [root@2020-06-09T14:53:22]
capello@login2:~$ scontrol show Node=gpu003 | grep Reason
   Reason=change-GPU [root@2020-06-09T14:53:25]
capello@login2:~$ scontrol show Node=gpu004 | grep Reason
   Reason=issues-1298 [root@2020-06-04T17:47:06]
capello@login2:~$ 

Thx, bye,
Luca

1 Like

Hi there,

  • gpu002 has been back into production since Thursday 2020-06-11 afternoon with now 6x TITAN X.
  • gpu003 has been renamed to gpu012 since it has now 8x RTX and it has been back into production since Thursday 2020-06-11 evening.

Back into production since yesterday evening.

Thx, bye,
Luca

1 Like