Storage on Baobab

Yann.Sagon · March 11, 2020, 11:41am

Hello,

I’m writing this quick post as we are/were suffering of poor performances on the scratch space.

The scratch space as well as the other storage on Baobab are shared space. This means that the performance depend of the workload of all the users. This is specially true on a cluster, where a single job may perform I/O from many different threads at the same time, even on multiple nodes. It’s hard to debug.

As a remainder, we are using a software named BeeGFS for providing the storage on Baobab. There are two kind of storage servers when you access data on a storage such as scratch:

metadata servers: they are involved when you type ls, find etc. and you don’t read the content. This is stored on fast NVMe disks. We have two metadata server.
storage servers: they are involved when you read or write a file. The data are stored on rotational disks. 120 disks in case of the scratch space. We have two storage servers.

What is usually the bottleneck is the metadata servers.
A performance killer is for example to read/write a lot of very small files or to traverse a directory with thousands of files. This is usually something that the user himself can improve, for example by limiting the number of files per directory.

Think twice before launching those tools on a shared storage:

updatedb
find
ncdu
du
everything that takes a long time to scan the storage

Best practices:

limit the number of files per directory (for example, 1000)
do not test the performance of the storage by launching a “benchmark”, as you’ll waste precious resources
work with bigger files (1MB instead of 1KB for example), bigger is better.
Do not read many times the same file, try to cache it instead.

If you need to share a bunch of data between compute nodes, instead of reading the data on every compute node from the scratch space, you can use BeeOND to create a volatile filesystem that will have the same lifetime as your job.

This is an introduction only, we’ll try to post a more detailed post about the storage on Baobab.

Best

Yann

Volodymyr.Savchenko · March 11, 2020, 12:42pm

Dear Yann,

I have an archive of data, stored in scratch space (as recommended in the previous communication).
Different jobs access different parts of the archive. So I am not entirely sure BeeOND would give an advantage, what do you think?

Looking forward to the promised post!

Best Regards

Volodymyr

Yann.Sagon · March 11, 2020, 2:23pm

I’m not sure either, in fact, I just noticed we never put BeeOND in production on Baobab so this is not a usable solution right now. We’ll try to evaluate it soon to see if it worth and if we can install it on the compute node without side effects.

Volodymyr.Savchenko · March 11, 2020, 2:29pm

great thanks!

I mostly want to be sure I am not myself creating excessive load on the disk, which can saturate both the performance of my jobs and annoy other users. I know my/our analysis is likely to do this, and I usually keep close track in case of any issues. For now, I did not notice any slow down even in the case of quite a large number of jobs at once.

Secondary goal was to understand why I was blocked last Friday (when I made another post, linked here). Maybe someone put a short-term but necessary load, but it’s important for me to know if this happens often, because it makes my jobs waste time slots (reading the image in 20 minutes instead of 2 seconds, and being killed before they are done).

Pablo.Strasser · March 19, 2020, 8:24am

In my case one big generator of many small files in the same folder is slurm itself with its log file. I know that I can manually change the path, however I normally just us the default. This is particularly problematic with job array which are name with a suffix added.
Is there a recommended folder where to put theses files which will surely be delete quite soon if there is no problem. I only keep the log the time a job finish and to investigate crashes. In some case I had so many slurm log that rm slurm* was generating argument too long and I needed to use find -delete to delete them. I try to clean them up when necessary but they may put an unnecessary load on the system.

Volodymyr.Savchenko · March 19, 2020, 8:32am

I just want to note that the issues I noticed were distinctly not with listing but with reading files with known locations. Listing/finding remained quite fast. I can not exclude that these operations were delayed in other cases, but I did not notice it. Also I do not even collect statistics on this in my jobs, since I very rarely do such operations (since they are indeed unnecessarily costly).

So, given the note in the title post, I suppose the problem was with storage servers and not metadata. But maybe I misunderstood some subtle interplay between the two.

Yann.Sagon · March 20, 2020, 12:30pm

you can maybe merge the log files to one big file for your job array with a script you launch once after each job array instance? If you have issue with too much log files, I guess you are facing the same issue with too much results? You should handle result and logs the same way maybe.

maciej.falkiewicz · January 13, 2023, 10:34am

@support is there a problem with storage on Baobab? To me it seems extremely slow, making it impossible to run experiments.

Adrien.Albert · January 18, 2023, 9:36am

Hi @maciej.falkiewicz

We are very sorry about the inconveniance, we do our best to resolve issues.

We have open an issue to Beegfs about the performance issue. We are waiting their answer.

Good news, we will soon receive 0.5PB of storage. This will increase the I/O possible with the new disks.