Hi there,
we have installed new nodes on Baobab:
gpu[017] (members of the shared-gpu partition)
capello@login2:~$ scontrol show Node=gpu017
NodeName=gpu017 Arch=x86_64 CoresPerSocket=64
CPUAlloc=16 CPUTot=128 CPULoad=3.88
AvailableFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_RTX
ActiveFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_RTX
Gres=gpu:rtx:8
NodeAddr=gpu017 NodeHostName=gpu017 Version=20.11.3
OS=Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019
RealMemory=512000 AllocMem=131072 FreeMem=464598 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=1500000 Weight=10 Owner=N/A MCS_label=N/A
Partitions=shared-gpu
BootTime=2021-03-22T13:22:39 SlurmdStartTime=2021-03-22T13:26:22
CfgTRES=cpu=128,mem=500G,billing=128
AllocTRES=cpu=16,mem=128G
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Comment=(null)
capello@login2:~$
Thx, bye,
Luca
1 Like
Hi Luca,
I am a bit confused by the specs of this node.
In particular, everything seems to suggest that the node is equipped with 8 RTX 2080Ti cards, but they are actually RTX 3090s.
To my knowledge these cards require CUDA >= 11 and so I think it would be important to correct the SLURM configuration.
Please let me know if I am missing something here.
Thank you!
Giuseppe
Hi there,
Giuseppe.Chindemi:
In particular, everything seems to suggest that the node is equipped with 8 RTX 2080Ti cards, but they are actually RTX 3090s.
To my knowledge these cards require CUDA >= 11 and so I think it would be important to correct the SLURM configuration.
Thank you for your vigilance!
Indeed, my fault, I misread the shipping notices and I was blinded by the RTX word (cf. Current issues on Baobab and Yggdrasil - #62 by Luca.Capello ), while they are Ampere ones (cf. 3090 & 3090 Ti-Grafikkarten ), documentation fixed (cf. hpc:hpc_clusters [eResearch Doc] ).
I have also fixed the Slurm configuration, adding the new COMPUTE_CAPABILITY_8_6 (cf. https://developer.nvidia.com/cuda-gpus#compute ), but for an unknown reason scontrol
does not show the correct Gres :
capello@login2:~$ scontrol show Node=gpu017 | grep -E '(Features|Gres)'
AvailableFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_8_6,COMPUTE_TYPE_AMPERE
ActiveFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_8_6,COMPUTE_TYPE_AMPERE
Gres=gpu:rtx:8
capello@login2:~$
More investigation needed, sorry for the inconvenience.
Thx, bye,
Luca
Great thanks! I found it by chance while debugging a script…
Hi there,
Everything is fine now, slurmctld
needs to be started after the node slurmd
:
capello@login2:~$ scontrol show Node=gpu017
NodeName=gpu017 Arch=x86_64 CoresPerSocket=64
CPUAlloc=57 CPUTot=128 CPULoad=0.32
AvailableFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_8_6,COMPUTE_TYPE_AMPERE
ActiveFeatures=EPYC-7742,V8,COMPUTE_CAPABILITY_8_6,COMPUTE_TYPE_AMPERE
Gres=gpu:ampere:8
NodeAddr=gpu017 NodeHostName=gpu017 Version=20.11.7
OS=Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019
RealMemory=512000 AllocMem=128672 FreeMem=488685 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=1500000 Weight=10 Owner=N/A MCS_label=N/A
Partitions=shared-gpu
BootTime=2021-07-01T17:25:34 SlurmdStartTime=2021-07-22T11:34:12
CfgTRES=cpu=128,mem=500G,billing=128
AllocTRES=cpu=57,mem=128672M
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Comment=(null)
capello@login2:~$
Thx, bye,
Luca