Dear HPC Community,
As 2020 closes, I want to better organize my hpc computations!
So far my workload consist of calculations on large sets (O(10^7)) of parameters which are processed in multiple stages (f.ex. obtaining wavefunction, using this wavefunction to obtain different observables, using those observables to compute some statistics, …) by various python scripts. Many of these steps are very small, O(1s), so they need to be packaged with others to form one scheduler job but should stil be considered logically independent. Depending on intermediate results, quite often we find new interesting sets of parameters which then need to be processed along the same steps, but then combined into the larger dataset.
Sometimes, the python scripts are updated to fix bugs or improve computational efficiency. To ensure reproducibility it is therefore important that every computational result can be associated with all code and parameters that are involved in its production (apparently this is called “Data Provenance”).
For now all of this is done by two scripts and a database; All parameters, code and high level results are stored in the database, one script writes a directory from the database with everything needed to perform a set of computations, including the sbatch file. Another script collects the results and registers which computations failed so they can be rerun individually, perhaps with a larger time/memory limit.
This works more or less, but still requires a fair bit of manual intervention and is sometimes buggy, so I am looking for a more mature solution.
From what I read AiiDA (http://www.aiida.net/) seems to fit those requirements, but it looks like it has a steep learning curve and takes quite a bit of effort to adapt. At today’s hpc lunch CWL and snakemake
were recommended as a potential solution to this problem. I am pretty sure that there are more solutions out there
Before choosing one of those systems and entrusting it with my calculations, I want to ask for your opinion.
Do you use this kind of software for your computations? Can you recommend using it? What are your experiences?
Thank you so much
Michael