Idle GPU nodes unavailable on Yggdrasil

Hello,

Nodes gpu[004-006] on Yggdrasil are idle, but I cannot get an allocation if I request a GPU (everything works fine without).

I also noticed that scontrol still reports Gres=gpu:rtx:8.

Are there any problems with these nodes?

Thank you!

Hello,

I do not find rtx in the report, and job allocations works for me. Can you check on your side and give me the command if you still have the issue ?

(yggdrasil)-[alberta@login1~]$ scontrol show node gpu[004-006] |egrep -i "rtx|turing"
   AvailableFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
   ActiveFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
   Gres=gpu:turing:8
   CfgTRES=cpu=16,mem=385499M,billing=16,gres/gpu=8,gres/gpu:turing=8
   AvailableFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
   ActiveFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
   Gres=gpu:turing:6
   CfgTRES=cpu=16,mem=385499M,billing=16,gres/gpu=6,gres/gpu:turing=6
   AvailableFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
   ActiveFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
   Gres=gpu:turing:4
   CfgTRES=cpu=16,mem=385499M,billing=16,gres/gpu=4,gres/gpu:turing=4

(yggdrasil)-[alberta@login1 ~]$ srun --partition=shared-gpu --gpus=turing:1 hostname
gpu005.yggdrasil

Hello,

Today I don’t see the “RTX” resource anymore and I can get a GPU node.

Here what I got yesterday:

[chindemi@login1.yggdrasil ~]$ scontrol show node gpu004
NodeName=gpu004 Arch=x86_64 CoresPerSocket=8 
   CPUAlloc=16 CPUTot=16 CPULoad=0.90
   AvailableFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
   ActiveFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
   Gres=gpu:rtx:8
   NodeAddr=gpu004 NodeHostName=gpu004 Version=21.08.2
   OS=Linux 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021 
   RealMemory=385499 AllocMem=48000 FreeMem=378807 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=150000 Weight=10 Owner=N/A MCS_label=N/A
   Partitions=public-gpu,shared-gpu 
   BootTime=2022-01-21T15:32:15 SlurmdStartTime=2022-01-21T15:33:05
   LastBusyTime=2022-01-24T13:47:48
   CfgTRES=cpu=16,mem=385499M,billing=16,gres/gpu=8,gres/gpu:turing=8
   AllocTRES=cpu=16,mem=48000M,gres/gpu=8
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Same command today:

[chindemi@login1.yggdrasil ~]$ scontrol show node gpu004
NodeName=gpu004 Arch=x86_64 CoresPerSocket=8 
   CPUAlloc=1 CPUTot=16 CPULoad=1.62
   AvailableFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
   ActiveFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
   Gres=gpu:turing:8
   NodeAddr=gpu004 NodeHostName=gpu004 Version=21.08.2
   OS=Linux 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021 
   RealMemory=385499 AllocMem=385499 FreeMem=374573 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=150000 Weight=10 Owner=N/A MCS_label=N/A
   Partitions=public-gpu,shared-gpu 
   BootTime=2022-01-21T15:32:16 SlurmdStartTime=2022-01-21T15:33:05
   LastBusyTime=2022-01-25T11:15:15
   CfgTRES=cpu=16,mem=385499M,billing=16,gres/gpu=8,gres/gpu:turing=8
   AllocTRES=cpu=1,mem=385499M,gres/gpu=1,gres/gpu:turing=1
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s