Discussion about compression and disk IO

Silas.Kieser · May 3, 2021, 5:47pm

Dear Yann,

Could you explain how one should use gzip on many or large files?
I noticed that the baobab disk might be good for parallel access, but for simple IO-bound commands, it’s very slow.

You suggest using temporary files, but usually, I don’t specify any temp files for the gzip command.
Could one use pigz and specify a temp directory to speed up the (un)compression?
Can we also use the simple /tmp/ directory for temp files?

I assume you want us still to submit jobs for uncompression to the slurm cluster, albeit not as an array. However, if a job gets killed (e.g. expand the time limit) the un/compression might be stopped somewhere in the middle and one can start all over again. Or is there an easy way to keep track of which file has finished processing and which not.

Yann.Sagon · May 4, 2021, 2:44pm

Dear Silas,

You are right. The metadata are on SSD and the data are on spinning disks, not very good at doing high IO stuff.

My suggestion was to use scratch or local scratch for temporary files, not to create them on purpose.
At least if you saturate the scratch storage the impact for other users will be lower. If you use the local scratch, this is better but the space is limited to ~150GB per compute node.

If you use the /scratch on a compute node, you have a dedicated space on the local SSD or spinning disk if using old node:

[sagon@login2 ~] $ salloc
salloc: Pending job allocation 46491815
salloc: job 46491815 queued and waiting for resources
salloc: job 46491815 has been allocated resources
salloc: Granted job allocation 46491815
[sagon@node002 ~] $

[sagon@node002 /] $ ls -la /scratch
total 8
drwx------   2 sagon unige 4096 May  4 16:07 .
dr-xr-xr-x. 30 root  root  4096 Apr 20 09:01 ..

Be carefull, this is not a physical path. If you connect to the node using ssh, the data in scratch are located here: /tmpslurm/.46491815.0/scratch/

You can as well use the in memory tmpfs /dev/shm on a compute node. But remember that this consumes the memory you requested when submitting your job (3GB per cpu per default):

[sagon@node002 /] $ dd if=/dev/urandom of=/dev/shm/myfile
Killed

dd was killed when the file myfile was reaching ~3G.

Yes it is better to do that on the compute node. But the issue may be that you’ll saturate the storage if you do that on a large scale.

I did some testing using gzip (not the best tool, only a widely used tool).

If you create an archive the standard way:

[sagon@login2 perf] $ time gzip -c large-file-10gb.txt > large-file-10gb.txt.gz

real    11m59.135s
user    11m20.859s
sys     0m16.219s

We can see that the user “sagon => 240477” is now in the top5 in terms of IO on our disk monitoring tool.

I’m now adding an input buffer and an output buffer to lower the number of IO operations:

[sagon@login2 perf] $ time mbuffer  -p 10 -m 1G -i large-file-10gb.txt | gzip  | mbuffer -P 90 -m 1G >  large-file-10gb.txt_buffer.gz
in @ 16.0 MiB/s, out @  0.0 kiB/s, 9872 MiB total, buffer  35% full, 100% done
summary: 10.0 GiByte in  6min 20.0sec - average of 26.9 MiB/s, 10x full
in @  0.0 kiB/s, out @  352 MiB/s, 10.0 GiB total, buffer   0% full
summary: 10.0 GiByte in  6min 20.9sec - average of 26.9 MiB/s, 10x empty

real    6m21.064s
user    6m16.015s
sys     0m41.498s

We can see that the operation was faster and my username is not anymore in the top 5 in the monitoring tool. But instead this produces “peaks”.

I think this is better as the storage has only to read sequential 1GB block instead of millions of smaller requests.

Other alternatives:

pigz
pbzip pzcat