Scratch Disk I/O Saturation

Matthew.Leigh · November 24, 2023, 12:58pm

Hi HPC Team

I was hoping you might be able to help me find a solution to a problem related to slow disk i/o from scratch during jobs.

A lot of our deep learning jobs in the group involve training set’s too large to fit into memory. The solution is therefore to stream it from disk during training. We use scratch for this as the datasets are particularly large.
This is standard practice and the functionality is built into many of the machine learning libraries we use like PyTorch.

A good example is this the PyTorch image folder dataset

https://pytorch.org/vision/stable/generated/torchvision.datasets.ImageFolder.html#torchvision.datasets.ImageFolder

During training, some threads asynchronously load batches of images from the disk and provide them to the GPU when called.

When I tried to use this method for training some of our models this morning (as the cluster came online), the throughput was super high, the GPU was being fully utilized. However around 10 minutes later as others started using the cluster, the streaming became a significant bottleneck and now GPU utilization during the job is sitting at around 5-10%. This plot shows the rapid decline in GPU usage.

The inconsistent i/o usage can be seen here:

These jobs were launched with the following script:

#!/bin/sh

#SBATCH --cpus-per-task=16
#SBATCH --mem=32GB
#SBATCH --time=7-00:00:00
#SBATCH --job-name=train_diffbeit
#SBATCH --output=/home/users/l/leighm/DiffBEIT/logs//%A_%a.out
#SBATCH --chdir=/home/users/l/leighm/DiffBEIT/scripts
#SBATCH --partition=shared-gpu,private-dpnc-gpu
#SBATCH --gres=gpu:ampere:1,VramPerGpu:20G

#SBATCH -a 0-2

network_name=( DiffBEIT BEIT SimMim )
model_target_=( src.models.diffbeit.DiffBEIT src.models.diffbeit.ClassicBEIT src.models.diffbeit.RegressBEIT )

export XDG_RUNTIME_DIR=""
srun apptainer exec --nv -B /srv,/home \
   /home/users/l/leighm/scratch/Images/anomdiff-image_latest.sif \
   python train.py \
       network_name=${network_name[`expr ${SLURM_ARRAY_TASK_ID} / 1 % 3`]} \
       model._target_=${model_target_[`expr ${SLURM_ARRAY_TASK_ID} / 1 % 3`]} \

So right now I am just looking for ways to solve this. Is there an efficient way to move the data from scratch? Copying it directly to the compute node might be an option but it would result in a huge overhead when starting the job. Is there a way to better utilize the scratch i/o? Are there some nodes that are faster on scratch?

Any help is greatly appreciated.

Matthew Leigh

Yann.Sagon · November 24, 2023, 1:16pm

Dear Matthew,

thanks for your post. Indeed the issue with shared storage is… that it is shared and the performance as well unfortunately.

Can you please give us more information about your use case:

what is the size of the dataset
reading your sbatch, I’m not able to see where your dataset is stored.
about your dataset: do you use frequently a new one?

Thanks

Best

Matthew.Leigh · November 24, 2023, 1:23pm

Hi Yann,

Thanks for the quick reply.
We have a few datasets but the specific one I am running at the moment is the ImageNet dataset.

It has around 1.4 Million jpg images. We store it here:
/srv/beegfs/scratch/groups/rodem/datasets/ImageNet

It is around 150 GB.

Matt

Yann.Sagon · November 24, 2023, 2:16pm

Dear @Matthew.Leigh

you have several possibilities:

at run time, copy the dataset to /scratch or /share/users/${SLURM_JOB_USER:0:1}/${SLURM_JOB_USER} on the compute node where your job is running. Pro: very fast acces. Cons: you need to pre copy the data multiple time
I’ve created as testing purpose a directory on a faster storage space: /srv/fast/share/rodem/. This is not ephemeral but not backuped as well and small in size. If you try this, thanks for your feedback.

ref: hpc:storage_on_hpc [eResearch Doc]

Matthew.Leigh · November 24, 2023, 2:31pm

Hi Yan.

I will do this. I just want to check a couple of things.

Will these files automatically be deleted when the job ends?
At the moment in the job these folders dont seem to exist, should I make them and will they definitely be on that specific node?

Matthew.Leigh · November 24, 2023, 2:42pm

Just saw that I had not yet mounted them. I will try this now.

Yann.Sagon · November 24, 2023, 3:03pm

when all your jobs end on the node, the content is erase.

The directory is created once your first job is started.

As your dataset is quite big, I suggest to use option two ( /srv/fast/share/rodem/) and we are more interested by the feedback about this solution too:) Data isn’t ereased once done.

Matthew.Leigh · November 24, 2023, 4:08pm

Hi Yan,

I see that there is already a folder ImageNet at /srv/fast/share/rodem/. Is the data already copied? I can’t seem to access it.

(baobab)-[leighm@gpu024 rodem]$ ls /srv/fast/share/rodem/ImageNet/
ls: cannot open directory '/srv/fast/share/rodem/ImageNet/': Permission denied

Yann.Sagon · November 24, 2023, 4:10pm

Oups, this was just a test and I forgot to clean it up, I’ve removed it now.

Matthew.Leigh · November 24, 2023, 4:10pm

Ok Thanks. I have started the copy into ImageNet2

Matthew.Leigh · November 27, 2023, 9:16am

Hi just some quick feedback on the /srv/fast storage. It seems great. Here are some plots showing the consistency of three jobs running on fast verses the usual scratch. The “NoFinalNorm” are the ones running on data stored on “/srv/fast”.

You can also see the GPUs are being fully utilized here:

Yann.Sagon · November 27, 2023, 12:51pm

This is great news! and it is a “cheap” solution for great benefit.

To all the users: feel free to let us know if you want a directory here too as a test.

maciej.falkiewicz · December 6, 2023, 12:16pm

I want it. Not for a test, forever! Please

Yann.Sagon · December 6, 2023, 3:18pm

@maciej.falkiewicz /srv/fast/users/f/falkiewi

Reminder:

this is trash space: the whole content may be erased without previous notice and there is no backup.
this is still a proof of concept (POC) as such it is only available on Baobab and the size is limited.

Please let us know if you have feedback about the solutions.

Maxime.Juventin · December 11, 2023, 1:00pm

Hello Yann,
I am really interested in this feature.
What are the sizes of the local storage? How do you prevent multiple jobs assigned to the same node from filling up the whole SSD?
Best
Maxime

Yann.Sagon · December 12, 2023, 3:30pm

Hi @Maxime.Juventin

the local storage is at least 150GB per server BUT unless you request the whole compute node, there is no guarantee the space is really free. This is an issue we are working on, we want permit the users to request a temporary size directly using Slurm. Stay tuned.

Matthew.Leigh · January 15, 2024, 11:57am

Hi Yann.

/srv/fast seems to now be empty and I don’t have the permission to write there.
Is there any update to if the fast storage is still usable?

Matthew Leigh

Yann.Sagon · January 15, 2024, 1:00pm

Hi, we are working on it, this should be solved today. [2024] Current issues on HPC Cluster - #2 by Yann.Sagon

Yann.Sagon · January 16, 2024, 4:09pm

This is fixed since yesterday. [2024] Current issues on HPC Cluster - #2 by Yann.Sagon

Matthew.Leigh · January 22, 2024, 1:15pm

Hi Yann.

Thanks so much for this. It is still empty and I can’t seem to make folders.

Could you create a rodem group folder for us and give us write acess