Gpu032/gpu033 down?

Both of the mentioned nodes show alloc on shared-gpu but when I try squeue -w gpu032 no jobs appear.

Are they taken by private partition jobs? If so, would it be possible to know how long those jobs have remaining to run? I need an 80GB memory gpu and if they are unavailable for several days I will need to find another solution.

Ok it seems they are indeed occupied on private allocations. I didn’t realize sinfo --all would show even private partitions.

You can find more information by running these commands :

(baobab)-[alberta@admin1 ~]$ sinfo -n gpu[032-033]
PARTITION              AVAIL  TIMELIMIT  NODES  STATE NODELIST
admin                     up 7-00:00:00      0    n/a 
debug-cpu*                up      15:00      0    n/a 
public-interactive-cpu    up    8:00:00      0    n/a 
public-longrun-cpu        up 14-00:00:0      0    n/a 
public-cpu                up 4-00:00:00      0    n/a 
public-short-cpu          up    1:00:00      0    n/a 
public-bigmem             up 4-00:00:00      0    n/a 
shared-cpu                up   12:00:00      0    n/a 
shared-bigmem             up   12:00:00      0    n/a 
shared-gpu                up   12:00:00      2  alloc gpu[032-033]
(baobab)-[alberta@admin1 ~]$ sacct -N gpu[032-033]
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
4342029      LM_TRAINI+ private-r+    toto         64    RUNNING      0:0 
4342029.bat+      batch               toto         64    RUNNING      0:0 
4342029.ext+     extern               toto         64    RUNNING      0:0 
4342030      LM_TRAINI+ private-r+    toto         64    RUNNING      0:0 
4342030.bat+      batch               toto         64    RUNNING      0:0 
4342030.ext+     extern               toto         64    RUNNING      0:0

Be careful, the node’s status can change very quickly. One second it may appear idle, and the next it’s allocated.

Best regards :slight_smile:

For others who might hit the same issue, I created a modified version of squeue which by default will show some extended formatting options that help figure out what’s going to be on a node for how long:

alias sqlong="squeue --all --Format=UserName:12,Partition:16,NumCPUs:6,TimeLimit:11,TimeUsed:10,StateCompact:3,Reason:15,NodeList"

which, for an example with gpu029, produces:

gercek@login2:~$ sq -w gpu029
USER        PARTITION       CPUS  TIME_LIMIT TIME      ST REASON         NODELIST
kinakh      private-sip-gpu 16    7-00:00:00 15:22:50  R  None           gpu029
drozdova    private-sip-gpu 16    5-20:00:00 15:22:50  R  None           gpu029
drozdova    private-sip-gpu 8     5-20:00:00 15:22:50  R  None           gpu029

Edit: forgot to note that you should save this in your ~/.bashrc if you’d like it to be available every login.