Hi, I have a question (out of curiosity) about pending jobs, especially when slurm says that NODELIST(REASON) is (Resources).
If I understood well, it means that there is not enough ressources available right now for the job to start. I’m wondering if there is a way to get more information about which ressource is limiting the job specifically. Is it memory? Total number of cpus asked? Or too many cpu-per-task?
It would be great if there would be a way to be able to know more precisely why a job is pending, so we can react and maybe adjust what we ask. Do you know if such an info can be found?
the reason is indeed not enough resources,i.e, you need to wait for another job to finish to have your job start. If the reason is priority, this means other jobs are in front of you in the queue.
So reason resources may be anything accountable (memory, gpu, cpu, license) I guess. Unless you asked a lot of memory per cpus or gpus, the reason is almost always related to the number of cpus you asked. If the cluster is full, even if you asked for one cpus your job will be pending.
If you ask for example to have a job with 20 cpus per task, this will force your job to request a compute node with at least 20 cpus, avoiding all the nodes with 12, 16 cpus. If your job can run with 12 or 16 cpus, it is better to ask for 12 cpus as you job will start faster. Less resource you ask, the faster your job will start.
To follow up on this, if I may. When I ask for more detailed info on a specific partition, for example shared bigmem, I ask for example sinfo -Nel --partition=shared-bigmem and get this as output:
Is there another way of asking this which shows how many nodes/memory are still available? This gets more relevant when you go to for example 4:8:1 configuration (like node 186) and it is listed as “mixed”
Hi Yann, this is really nice, but not node specific. I dont want to assign to a specific node, but it helps to know how many nodes are available before posting multiple jobs
I was thinking more along the lines of what Quentin suggested above, but a bit more extensive even. I hope this is helpful to others as well:
NODELIST AVAIL STATE S:C:T CPUS(A/I/O/T) CPU_LOAD MEMORY FREE_MEM
node056 up mix 2:8:1 10/6/0/16 46.00 256000 158255
node154 up idle 2:8:1 0/16/0/16 0.02 256000 237367
node186 up mix 4:8:1 16/16/0/32 15.48 768000 495311
node203 up mix 2:14:1 16/12/0/28 15.90 512000 414726
node218 up idle 2:4:1 0/8/0/8 0.01 512000 419831
node219 up idle 2:4:1 0/8/0/8 0.06 512000 493150
node245 up mix 2:10:1 1/19/0/20 20.11 256000 192944
node246 up idle 2:10:1 0/20/0/20 0.01 256000 224643
Edit: looking more at the spart output, this is also really useful to predict how long it will take before your job starts. Very insightful discussion this