_gpu[002,012]_ still down at 16:00 on 2020-06-18

Hi Luca,

Today 11:44 AM - The service is restored.

Is this referring only to the ${HOME} storage? The gpu002 and gpu012 nodes are still down, now due to an unexpected reboot.

Cheers,
Johnny

Hi there,

The service restored at 11:44 was indeed referring to the ${HOME} storage only (cf. Current issues on Baobab and Yggdrasil - #13 by Luca.Capello ).

gpu[002,012] are still DOWN in Slurm given that:

  • gpu002 PSU2 broke, and given that this node has 6 TITAN X one PSU is not enough, replacement already asked for.
  • gpu012 was fine, but while doing the last check before Slurm activation I found another problem in the same rack (leaf7 , cf. Current issues on Baobab and Yggdrasil - #15 by Luca.Capello ).

Thx, bye,
Luca

Hi there,

With leaf7 back to production, gpu012 is now also available again, actually already used:

capello@login2:~$ scontrol show Node=gpu012
NodeName=gpu012 Arch=x86_64 CoresPerSocket=6 
   CPUAlloc=12 CPUTot=12 CPULoad=2.62
   AvailableFeatures=E5-2643V3,V5,COMPUTE_CAPABILITY_6_1,COMPUTE_TYPE_RTX
   ActiveFeatures=E5-2643V3,V5,COMPUTE_CAPABILITY_6_1,COMPUTE_TYPE_RTX
   Gres=gpu:rtx:8
   NodeAddr=gpu012 NodeHostName=gpu012 Version=19.05.7
   OS=Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 
   RealMemory=257820 AllocMem=36000 FreeMem=248035 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=300000 Weight=10 Owner=N/A MCS_label=N/A
   Partitions=shared-gpu-EL7,dpnc-gpu-EL7 
   BootTime=2020-06-18T17:37:34 SlurmdStartTime=2020-06-18T17:38:21
   CfgTRES=cpu=12,mem=257820M,billing=12
   AllocTRES=cpu=12,mem=36000M
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

capello@login2:~$ 

Thx, bye,
Luca

Hi there,

gpu002 has been put back into production yesterday evening at 22:10 (cf. Current issues on Baobab and Yggdrasil - #12 by Luca.Capello ).

Thx, bye,
Luca