Gpu012 down, Not responding

Hello,

It looks like gpu012 is down since April 30th, 2021.

Here’s the output:

NodeName=gpu012 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUTot=12 CPULoad=8.32
   AvailableFeatures=E5-2643V3,V5,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_RTX
   ActiveFeatures=E5-2643V3,V5,COMPUTE_CAPABILITY_7_5,COMPUTE_TYPE_RTX
   Gres=gpu:rtx:8
   NodeAddr=gpu012 NodeHostName=gpu012 Version=20.11.3
   OS=Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019
   RealMemory=257820 AllocMem=0 FreeMem=238231 Sockets=2 Boards=1
   State=DOWN* ThreadsPerCore=1 TmpDisk=300000 Weight=10 Owner=N/A MCS_label=N/A
   Partitions=shared-gpu,private-dpnc-gpu
   BootTime=2021-04-12T11:48:48 SlurmdStartTime=2021-04-21T12:12:58
   CfgTRES=cpu=12,mem=257820M,billing=12
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Not responding [slurm@2021-04-30T15:37:31]
   Comment=(null)

Regards,
Debajyoti

Is there an update on this? The node is still down.

Cheers,
Johnny

Hi there,

gpu004.baobab and gpu012.baobab shared the same PDU, but the latest computations are demanding too much power and we are thus decommissioning old nodes to recover power.

Sorry for the inconvenience.

Thx, bye,
Luca

Hi there,

node[088-091,136-139] now decommissioned (cf. [Baobab] Old nodes decommissioned: node[088-091,136-139] ), thus gpu012 back to business.

Thx, bye,
Luca