Hello,
Nodes gpu[004-006]
on Yggdrasil are idle, but I cannot get an allocation if I request a GPU (everything works fine without).
I also noticed that scontrol
still reports Gres=gpu:rtx:8
.
Are there any problems with these nodes?
Thank you!
Hello,
I do not find rtx in the report, and job allocations works for me. Can you check on your side and give me the command if you still have the issue ?
(yggdrasil)-[alberta@login1~]$ scontrol show node gpu[004-006] |egrep -i "rtx|turing"
AvailableFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
ActiveFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
Gres=gpu:turing:8
CfgTRES=cpu=16,mem=385499M,billing=16,gres/gpu=8,gres/gpu:turing=8
AvailableFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
ActiveFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
Gres=gpu:turing:6
CfgTRES=cpu=16,mem=385499M,billing=16,gres/gpu=6,gres/gpu:turing=6
AvailableFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
ActiveFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
Gres=gpu:turing:4
CfgTRES=cpu=16,mem=385499M,billing=16,gres/gpu=4,gres/gpu:turing=4
(yggdrasil)-[alberta@login1 ~]$ srun --partition=shared-gpu --gpus=turing:1 hostname
gpu005.yggdrasil
Hello,
Today I don’t see the “RTX” resource anymore and I can get a GPU node.
Here what I got yesterday:
[chindemi@login1.yggdrasil ~]$ scontrol show node gpu004
NodeName=gpu004 Arch=x86_64 CoresPerSocket=8
CPUAlloc=16 CPUTot=16 CPULoad=0.90
AvailableFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
ActiveFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
Gres=gpu:rtx:8
NodeAddr=gpu004 NodeHostName=gpu004 Version=21.08.2
OS=Linux 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021
RealMemory=385499 AllocMem=48000 FreeMem=378807 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=150000 Weight=10 Owner=N/A MCS_label=N/A
Partitions=public-gpu,shared-gpu
BootTime=2022-01-21T15:32:15 SlurmdStartTime=2022-01-21T15:33:05
LastBusyTime=2022-01-24T13:47:48
CfgTRES=cpu=16,mem=385499M,billing=16,gres/gpu=8,gres/gpu:turing=8
AllocTRES=cpu=16,mem=48000M,gres/gpu=8
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Same command today:
[chindemi@login1.yggdrasil ~]$ scontrol show node gpu004
NodeName=gpu004 Arch=x86_64 CoresPerSocket=8
CPUAlloc=1 CPUTot=16 CPULoad=1.62
AvailableFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
ActiveFeatures=SILVER-4208,XEON_SILVER_4208,V9,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_TURING
Gres=gpu:turing:8
NodeAddr=gpu004 NodeHostName=gpu004 Version=21.08.2
OS=Linux 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021
RealMemory=385499 AllocMem=385499 FreeMem=374573 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=150000 Weight=10 Owner=N/A MCS_label=N/A
Partitions=public-gpu,shared-gpu
BootTime=2022-01-21T15:32:16 SlurmdStartTime=2022-01-21T15:33:05
LastBusyTime=2022-01-25T11:15:15
CfgTRES=cpu=16,mem=385499M,billing=16,gres/gpu=8,gres/gpu:turing=8
AllocTRES=cpu=1,mem=385499M,gres/gpu=1,gres/gpu:turing=1
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s