Dear Silas,
You are right. The metadata are on SSD and the data are on spinning disks, not very good at doing high IO stuff.
My suggestion was to use scratch or local scratch for temporary files, not to create them on purpose.
At least if you saturate the scratch storage the impact for other users will be lower. If you use the local scratch, this is better but the space is limited to ~150GB per compute node.
If you use the /scratch on a compute node, you have a dedicated space on the local SSD or spinning disk if using old node:
[sagon@login2 ~] $ salloc
salloc: Pending job allocation 46491815
salloc: job 46491815 queued and waiting for resources
salloc: job 46491815 has been allocated resources
salloc: Granted job allocation 46491815
[sagon@node002 ~] $
[sagon@node002 /] $ ls -la /scratch
total 8
drwx------ 2 sagon unige 4096 May 4 16:07 .
dr-xr-xr-x. 30 root root 4096 Apr 20 09:01 ..
Be carefull, this is not a physical path. If you connect to the node using ssh, the data in scratch are located here: /tmpslurm/.46491815.0/scratch/
You can as well use the in memory tmpfs /dev/shm
on a compute node. But remember that this consumes the memory you requested when submitting your job (3GB per cpu per default):
[sagon@node002 /] $ dd if=/dev/urandom of=/dev/shm/myfile
Killed
dd
was killed when the file myfile
was reaching ~3G.
Yes it is better to do that on the compute node. But the issue may be that you’ll saturate the storage if you do that on a large scale.
I did some testing using gzip (not the best tool, only a widely used tool).
If you create an archive the standard way:
[sagon@login2 perf] $ time gzip -c large-file-10gb.txt > large-file-10gb.txt.gz
real 11m59.135s
user 11m20.859s
sys 0m16.219s
We can see that the user “sagon => 240477” is now in the top5 in terms of IO on our disk monitoring tool.
I’m now adding an input buffer and an output buffer to lower the number of IO operations:
[sagon@login2 perf] $ time mbuffer -p 10 -m 1G -i large-file-10gb.txt | gzip | mbuffer -P 90 -m 1G > large-file-10gb.txt_buffer.gz
in @ 16.0 MiB/s, out @ 0.0 kiB/s, 9872 MiB total, buffer 35% full, 100% done
summary: 10.0 GiByte in 6min 20.0sec - average of 26.9 MiB/s, 10x full
in @ 0.0 kiB/s, out @ 352 MiB/s, 10.0 GiB total, buffer 0% full
summary: 10.0 GiByte in 6min 20.9sec - average of 26.9 MiB/s, 10x empty
real 6m21.064s
user 6m16.015s
sys 0m41.498s
We can see that the operation was faster and my username is not anymore in the top 5 in the monitoring tool. But instead this produces “peaks”.
I think this is better as the storage has only to read sequential 1GB block instead of millions of smaller requests.
Other alternatives: