Hi Luca,
Today 11:44 AM - The service is restored.
Is this referring only to the ${HOME} storage? The gpu002 and gpu012 nodes are still down, now due to an unexpected reboot.
Cheers,
Johnny
Hi Luca,
Today 11:44 AM - The service is restored.
Is this referring only to the ${HOME} storage? The gpu002 and gpu012 nodes are still down, now due to an unexpected reboot.
Cheers,
Johnny
Hi there,
The service restored at 11:44 was indeed referring to the ${HOME} storage only (cf. Current issues on Baobab and Yggdrasil - #13 by Luca.Capello ).
gpu[002,012] are still DOWN in Slurm given that:
Thx, bye,
Luca
Hi there,
With leaf7 back to production, gpu012 is now also available again, actually already used:
capello@login2:~$ scontrol show Node=gpu012
NodeName=gpu012 Arch=x86_64 CoresPerSocket=6
CPUAlloc=12 CPUTot=12 CPULoad=2.62
AvailableFeatures=E5-2643V3,V5,COMPUTE_CAPABILITY_6_1,COMPUTE_TYPE_RTX
ActiveFeatures=E5-2643V3,V5,COMPUTE_CAPABILITY_6_1,COMPUTE_TYPE_RTX
Gres=gpu:rtx:8
NodeAddr=gpu012 NodeHostName=gpu012 Version=19.05.7
OS=Linux 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019
RealMemory=257820 AllocMem=36000 FreeMem=248035 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=300000 Weight=10 Owner=N/A MCS_label=N/A
Partitions=shared-gpu-EL7,dpnc-gpu-EL7
BootTime=2020-06-18T17:37:34 SlurmdStartTime=2020-06-18T17:38:21
CfgTRES=cpu=12,mem=257820M,billing=12
AllocTRES=cpu=12,mem=36000M
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
capello@login2:~$
Thx, bye,
Luca
Hi there,
gpu002 has been put back into production yesterday evening at 22:10 (cf. Current issues on Baobab and Yggdrasil - #12 by Luca.Capello ).
Thx, bye,
Luca