Hi @Maxime.Juventin,
I implemented a local scratch share system about a year ago:
Local Share Directory Between Jobs on Compute
This shared space is cleaned after the last job on the node.
To find local storage information using Slurm, you can use the scontrol
command:
(bamboo)-[alberta@login1 ~]$ scontrol show node cpu001
NodeName=cpu001 Arch=x86_64 CoresPerSocket=64
CPUAlloc=0 CPUEfctv=126 CPUTot=128 CPULoad=0.08
AvailableFeatures=EPYC-7742,V8,TmpDisk800G
ActiveFeatures=EPYC-7742,V8,TmpDisk800G
Gres=(null)
NodeAddr=cpu001 NodeHostName=cpu001 Version=23.11.7
OS=Linux 4.18.0-513.24.1.el8_9.x86_64 #1 SMP Thu Apr 4 18:13:02 UTC 2024
RealMemory=512000 AllocMem=0 FreeMem=510129 Sockets=2 Boards=1
CoreSpecCount=2 CPUSpecList=63,127
State=IDLE ThreadsPerCore=1 TmpDisk=800000 Weight=10 Owner=N/A MCS_label=N/A
Partitions=debug-cpu
BootTime=2024-08-08T15:13:22 SlurmdStartTime=2024-08-12T15:07:44
LastBusyTime=2024-08-12T15:13:06 ResumeAfterTime=None
CfgTRES=cpu=126,mem=500G,billing=126
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
You can target nodes with a minimum of XXXGB
of storage, but keep in mind that this does not guarantee that the node will have sufficient storage available, as the storage is shared among all running jobs.
On our end, it might be useful to configure a new constraint to specify node types with XXXGB
of storage.
For example (I just test it), we can define a TmpDiskXXXG
feature (5th line in the output):
(bamboo)-[alberta@login1 ~]$ scontrol show node cpu001
NodeName=cpu001 Arch=x86_64 CoresPerSocket=64
CPUAlloc=0 CPUEfctv=126 CPUTot=128 CPULoad=0.08
AvailableFeatures=EPYC-7742,V8,TmpDisk800G
ActiveFeatures=EPYC-7742,V8,TmpDisk800G
Gres=(null)
NodeAddr=cpu001 NodeHostName=cpu001 Version=23.11.7
OS=Linux 4.18.0-513.24.1.el8_9.x86_64 #1 SMP Thu Apr 4 18:13:02 UTC 2024
RealMemory=512000 AllocMem=0 FreeMem=510129 Sockets=2 Boards=1
CoreSpecCount=2 CPUSpecList=63,127
State=IDLE ThreadsPerCore=1 TmpDisk=800000 Weight=10 Owner=N/A MCS_label=N/A
Partitions=debug-cpu
BootTime=2024-08-08T15:13:22 SlurmdStartTime=2024-08-12T15:07:44
LastBusyTime=2024-08-12T15:13:06 ResumeAfterTime=None
CfgTRES=cpu=126,mem=500G,billing=126
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
In this example, the constraint ActiveFeatures=TmpDisk800G
indicates that the maximum local disk available is 800GB.
To start a job with this constraint:
(bamboo)-[alberta@login1 ~]$ srun --constraint=TmpDisk800G hostname
srun: job 53734 queued and waiting for resources
srun: job 53734 has been allocated resources
cpu001.bamboo
I’ll keep this idea in mind for our next team meeting and explore the potential benefits (or disadvantage) of it.