Improving the hpc Workflow

Dear HPC Community,

As 2020 closes, I want to better organize my hpc computations!
So far my workload consist of calculations on large sets (O(10^7)) of parameters which are processed in multiple stages (f.ex. obtaining wavefunction, using this wavefunction to obtain different observables, using those observables to compute some statistics, …) by various python scripts. Many of these steps are very small, O(1s), so they need to be packaged with others to form one scheduler job but should stil be considered logically independent. Depending on intermediate results, quite often we find new interesting sets of parameters which then need to be processed along the same steps, but then combined into the larger dataset.
Sometimes, the python scripts are updated to fix bugs or improve computational efficiency. To ensure reproducibility it is therefore important that every computational result can be associated with all code and parameters that are involved in its production (apparently this is called “Data Provenance”).

For now all of this is done by two scripts and a database; All parameters, code and high level results are stored in the database, one script writes a directory from the database with everything needed to perform a set of computations, including the sbatch file. Another script collects the results and registers which computations failed so they can be rerun individually, perhaps with a larger time/memory limit.
This works more or less, but still requires a fair bit of manual intervention and is sometimes buggy, so I am looking for a more mature solution.
From what I read AiiDA (http://www.aiida.net/) seems to fit those requirements, but it looks like it has a steep learning curve and takes quite a bit of effort to adapt. At today’s hpc lunch CWL and snakemake
were recommended as a potential solution to this problem. I am pretty sure that there are more solutions out there :slight_smile:

Before choosing one of those systems and entrusting it with my calculations, I want to ask for your opinion.
Do you use this kind of software for your computations? Can you recommend using it? What are your experiences?

Thank you so much
Michael

1 Like

Hi

sorry for the long response, this topic interests me quite a lot, and yet I do not think there is an ultimate solution. And perhaps even there should not be.

so we use CWL and a little bit Snakemake. As well as a bunch of custom developments.
Perhaps Snakemake causes overhead issues with particular, large workloads (as @Yann.Sagon pointed out), we did not try to use it for those.

I heard good things about Pegasus WMS, used by some major projects in our community.
Also Luigi is used by some people, and similar to what we use.

But maybe sometimes it is not especially efficient to try learning generic complex system doing it for you, since it is not that difficult to ensure provenance tracing and provenance-based caching with some python and bash scripts, well adapted to your use case.
Also much of these systems might introduce overhead per step which is more than 1s.

As indeed @Pablo.Strasser mentioned you can use a lot of slurm features in such a development.

I must say 1e7 x multiple 1s steps seems like quite a log of individual pipeline nodes, simply tracking provenance of all of them might become a challenge, relative to actual compute time. Unless there is very substantial overlap between some steps. Or I misunderstood.

We have perhaps a vaguely similar case, something like 3e5 data units, and ~50 steps applied to each (most 10s of seconds, some 1000s, but some are quite short, symbolic). Steps are often reassembled in a different workflows, versions of some of them are updated. And we need to keep track of all that, avoiding to compute parts which are already done, and using provenance to understand the origin of results. Data units are then grouped in sets of 100-s to 10s of 1000s, in some cases the entire set is grouped. Additional steps are applied to group results.
In this case, turns out we have to treat provenance is an unconventional way, by expressing it with higher order functions (such as map). Otherwise size of the final provenance graph far exceeds size of the resulting product.

The best common framework for this use case, to my experience, is CWL ecosystem, which we use where feasible.

But I do not exclude that other people found better solutions. I would be quite interested to see, in fact.
So it’s very nice you ask this question!

by the way, this topic might touch the domain of this resource https://dataforum.unige.ch/ , which deals with data management, FAIR practices, etc. Perhaps mostly preservation, publishing, sharing.

But I think data analysis workflow management is integrally related to data management. And I do not mean just dealing with some codes as if they are data themselves (which they are, of course), but instead capturing “live” computable nature of the workflows, their links to the data in provenance, and their capacity to modify data. Which is basically what workflow management system, with its workflow description capacity, presumes.

Best Regards

Volodymyr

1 Like

My current technique but for job that are a lot longer (approximatively 15 minutes to 1 hours) is to use slurm dependencies afterok and afternotok to detect sucess and failure.
In my case as I base my slurm work on a kubernetes custom ressource definition I update the ressource with a kubectl command but you could use other commands.

To create the “dependencies” I have some custom shell script launched by kubernetes to ssh into the login node and launch multiple sbatch command. The system is not ideal but work.

You could also use afterok to rsync your result and even add the result into a database.

1 Like

Thanks for the comprehensive answers!
I will definitely checkout Pegasus, as it ticks many of the boxes (like job clustering, python API, automatic data transfers, job monitoring, retry with higher memory, …). Will see whether it is scalable or not and whether intermediate, incomplete results can be accessed. Luigi and CWL seem to be easier to use, but on the surface seem to not provide everything I need (maybe it is viable to use them as a basis and build the rest myself). The ideal solution would allow for a jupyter notebook where plot points magically appear and get updated as more data gets computed on the cluster, but I guess that remains a fantasy :laughing:

With the current approach the overhead for provenance is not a problem, as it just requires to store a reference in a database somewhere to the computation parameter, code, etc. for every computation performed, not sure what you meant. Maybe I didn’t completely understand what provenance entails, but I am at least able to reproduce computation results from scratch :slight_smile: .

I was thinking a bit about the issue with having too many files, but it seems hard to avoid if one has many independent tasks, as they can potentially run in parallel and thus need to write in separate files to avoid synchronization (which I have no idea how to implement across different jobs and would probably introduce a lot of complexity and possibility for things to break). But at no point I need to list any large directory, so I am hoping this is not too bad :sweat: . If anyone has some suggestions how to sidestep this issue, I am happy to hear!

Thank you all!

Actually I would say this specific behavior can be achieved relatively easily with any of the means listed. We basically do just that. Some workflows map over inputs, others aggregate them, and the result is shown in a notebook (well, not entirely magically, on re-running the cell; but might as well make it in a loop).

Oh and it made me think of another solution, (py)spark. You probably heard about it, very industry-ready. We use it for subset of tasks where really interactive processing of stream events is needed.
You’d need to write the functions for it more customarily, though. But it certainly scalable.
We do not run our spark cluster on slurm, but it is possible, apparently.

What are the specific features lacking e.g. in CWL? In it’s case the functionality is mainly implemented by the runners, since CWL itself is just a description language.
Maybe we could make a check-list.

It depends on how you describe provenance. There is a W3C standard, PROV-O.
Here is an example of some small-ish process provenance visualization:

If you make some combined result from 1e7 different workflows with different parameters (e.g. some statistics plot) provenance of this one aggregate result will be a graph with some factor x 1e7 nodes (and some bytes per node), i.e. larger than a typical summary plot I guess. Provenance of each individual workflow (assuming it has not too many nodes) would not be too large though.

We too have problem with many small files, and admins frequently complain about it (not here though, not yet).
But yeah, as long as you are careful (do not list, do not copy by directory, etc) it is not, usually, a big problem. Depending on filesystem.
For the smallest, most numerous, data units sets it’s worth to abandon files and let a remote database handle them. With it’s own challenges.

spark is now also on the list of stuff to check out in more detail during the winter break! Thanks!
For CWL, I was looking at toil, since it seems to be an implementation which supports slurm, can you recommend another one? Maybe Arvados is worth a look?

Is there any benefit of creating a Prov-O file (besides producing a nice W3C standardized file of course)? Almost all of the solutions I had a closer look on stored the same information in some kind of database or other structured file.

Actually I tried the database approach once, but it has several drawbacks. First, since the database can not really be run inside of the cluster (at least I wouldn’t know how to do it properly) it creates a lot of network I/O which is also a shared resource. Second the jobs then depend on the availability of this external resource and cannot store their results if it is unavailable. This is of course true for BeeGFS as well, but there the chance is much higher that someone else notices and our amazing hpc-team fixes it in time before any results are lost :slight_smile: .
In an industry or cloud setting, those trade-offs might be different though.
I spoke today with @Christophe.Berthod about this issue and he pointed out that Fortran has a feature called direct access files, which essentially reads and writes non-overlapping portions of files from different threads. Depending on the concurrency guarantees BeeGFS offers, this might also work from different nodes on the cluster. It would however still introduce another layer of complexity which makes it harder to manually fix problems if they appear.

perhaps, I do not have practical experience with arvados, seems interesting.

Benefit of using common standard/language for expressing provenance is that it is easier to explain your provenance to other people, within or even outside of your current project.
You can then also attach it to publications, making it reproducible for everybody.
But of course it is always some effort to make provenance sharable.
Just that if you possibly want that, you can benefit from an existing standard.

yes, exactly: you need to maintain your own database outside of the cluster and you need to make it available.
and of course limit the network IO similarly to how you limit disk IO.

there are specific technologies to maintain “live” (long-running, communicating) workloads, such as Kubernetes, which we use a small cluster in our project/department, @Pablo.Strasser uses as well as he mentioned.

You mean to use it to clump small files in larger ones, and access simultaneously? Does seem dangerously dependent of filesystem behavior. Curious if it could work here, it could be helpful in some cases.

Actually I think specifically the question of handling many small files is a bit outside original topic of this thread, and it’s a kind of think admins here (@Yann.Sagon? ) could have an especially good idea, since it’s really system-level.

Hello,

The issue is that you don’t know how the data are stored on BeeGFS and the risk is that you may access a non overlapping part of a file that is stored on the same server and or disk. If you use multi-threading for that operation, it may involve a high load on the server. You must be sure of the chunk size. And even like that, there is the underlying RAID layer that will strip the data again.

You can have some insight about a file : Deep Inspection — BeeGFS Documentation 7.4.2

It may be possible as well to define a special stripping for a directory: Striping — BeeGFS Documentation 7.4.2 but I don’t think you can change this parameter as user.

Best

Hello,

what about using something like hdf5 instead of trying to handle this issue manually? https://portal.hdfgroup.org/display/knowledge/Parallel+HDF5

Best

Yann