Access large files in paralell


As far as I understand the filesystem of baobab makes i/o a bit slow but allows multiple jobs/threads to access the same data file.

Do you have a suggestion on how to make use of this feature (in python)? What are good data formats to store big files, e.g. hdf?

I’m struggling with a big data table (200mio lines of text in hdf format), even when I read the file in chunks a large overhead of memory is needed.

Kind regards

Is your problem memory usage or file access latency?

One solution to get faster io is to copy the files in local storage in the /scratch directory.
If you need even faster io you can use a ramdisk. The problem with both approaches is that you need to copy the file into local storage at the start of the job. However it could be interessting if you expect to query a file a lot. Not also that by itself the filesystem will use RAM to cache recently accessed file.

How big is the file that you want to read and what kind of access pattern do you expect?

Dear Pablo,

Thank you for your suggestion. In fact, copying the file to the temporary storage greatly accelerated the script.
The goal is to have a hdf file that allows selecting specific rows based on criteria.

In order to write out a large amount of data I realized writing first to /scratch also accelerates the process. However I’m not sure if coping the file from /scratch to the filesystem is so slow that the time won is lost again?

If your writting is mainly random write and your task is long enough (multiple hours). I think copying on /scratch and then copy at the start and end of the task is faster. The reason is that database manipulation is mainly random read and write whereas the initial and final copy is sequential copy which is significantly faster. Of course it is always good to measure in practice.

Thank you, I will copy the files forth and back.

How do you mean the acces is random? I read a hdf file chuck by chunk. But you mean this translates to random access on the filesystem?

Hi @Silas.Kieser

You have already your data in your home directory on Baobab right? In this case, accessing your data on home or scratch should be similar, unless one of the fs is doing heavy I/O, for this reason there is no point to copy your data to scratch back an forth. The main difference between home and scratch is that scratch isn’t backuped and is bigger.

I don’t know the file size we are talking about. If it’s in the order of 100Go, you can copy your hdf file to local scratch on the compute node you use. On recent node it’s SSD storage and it may be faster than scratch or home. Keep in mind that the data are erased automatically at the end of the job.