[Baobab] New nodes installed: gpu[017]

Luca.Capello · March 22, 2021, 1:35pm

Hi there,

we have installed new nodes on Baobab:

gpu[017] (members of the shared-gpu partition)

capello@login2:~$ scontrol show Node=gpu017
NodeName=gpu017 Arch=x86_64 CoresPerSocket=64
   CPUAlloc=16 CPUTot=128 CPULoad=3.88
   AvailableFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_RTX
   ActiveFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_RTX
   Gres=gpu:rtx:8
   NodeAddr=gpu017 NodeHostName=gpu017 Version=20.11.3
   OS=Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019
   RealMemory=512000 AllocMem=131072 FreeMem=464598 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=1500000 Weight=10 Owner=N/A MCS_label=N/A
   Partitions=shared-gpu
   BootTime=2021-03-22T13:22:39 SlurmdStartTime=2021-03-22T13:26:22
   CfgTRES=cpu=128,mem=500G,billing=128
   AllocTRES=cpu=16,mem=128G
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

capello@login2:~$

Thx, bye,
Luca

Giuseppe.Chindemi · July 21, 2021, 3:42pm

Hi Luca,

I am a bit confused by the specs of this node.

In particular, everything seems to suggest that the node is equipped with 8 RTX 2080Ti cards, but they are actually RTX 3090s.

To my knowledge these cards require CUDA >= 11 and so I think it would be important to correct the SLURM configuration.

Please let me know if I am missing something here.

Thank you!

Giuseppe

Luca.Capello · July 22, 2021, 10:01am

Hi there,

Thank you for your vigilance!

Indeed, my fault, I misread the shipping notices and I was blinded by the RTX word (cf. Current issues on Baobab and Yggdrasil - #62 by Luca.Capello), while they are Ampere ones (cf. 3090 & 3090 Ti-Grafikkarten), documentation fixed (cf. hpc:hpc_clusters [eResearch Doc]).

I have also fixed the Slurm configuration, adding the new COMPUTE_CAPABILITY_8_6 (cf. https://developer.nvidia.com/cuda-gpus#compute), but for an unknown reason scontrol does not show the correct Gres:

capello@login2:~$ scontrol show Node=gpu017 | grep -E '(Features|Gres)'
   AvailableFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_8_6,COMPUTE_TYPE_AMPERE
   ActiveFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_8_6,COMPUTE_TYPE_AMPERE
   Gres=gpu:rtx:8
capello@login2:~$

More investigation needed, sorry for the inconvenience.

Thx, bye,
Luca

Giuseppe.Chindemi · July 22, 2021, 11:19am

Great thanks! I found it by chance while debugging a script…

Luca.Capello · July 22, 2021, 3:53pm

Hi there,

Everything is fine now, slurmctld needs to be started after the node slurmd:

capello@login2:~$ scontrol show Node=gpu017
NodeName=gpu017 Arch=x86_64 CoresPerSocket=64
   CPUAlloc=57 CPUTot=128 CPULoad=0.32
   AvailableFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_8_6,COMPUTE_TYPE_AMPERE
   ActiveFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_8_6,COMPUTE_TYPE_AMPERE
   Gres=gpu:ampere:8
   NodeAddr=gpu017 NodeHostName=gpu017 Version=20.11.7
   OS=Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019
   RealMemory=512000 AllocMem=128672 FreeMem=488685 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=1500000 Weight=10 Owner=N/A MCS_label=N/A
   Partitions=shared-gpu
   BootTime=2021-07-01T17:25:34 SlurmdStartTime=2021-07-22T11:34:12
   CfgTRES=cpu=128,mem=500G,billing=128
   AllocTRES=cpu=57,mem=128672M
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

capello@login2:~$

Thx, bye,
Luca