Running a monothreaded job multiple times (here: cfm-predict)

Hi,

I am trying to run efficiently cfmid on the HPC.

What I did until now:

module load GCC/6.3.0-2.27 Singularity/2.4.2
singularity build --sandbox cfm-4/cfm.simg docker://wishartlab/cfmid

Having the cfm.simg I can then:

singularity run cfm-4/cfm.simg -c "cfm-predict 'CC(C)NCC(O)COC1=CC=C(CCOCC2CC2)C=C1' 0.001 /trained_models_cfmid4.0/[M+H]+/param_output.log /trained_models_cfmid4.0/[M+H]+/param_config.txt 0" | tee scratch/testout.txt

This is only to check everything works fine (it does).

We can so create a small test file:

echo 'HVYWMOMLDIMFJA CC(C)CCCC(C)C1CCC2C3CC=C4CC(O)CCC4(C)C3CCC12C
BWGQUGBECNWWDB CC(C)CC=CC(C)C1CCC2C3CCC4CC(O)CCC4(C)C3CCC12C
NYWZDGGKTLARLX C=C(CCC(C)C1CCC2C3CCC4CC(O)CCC4(C)C3CCC12C)C(C)C
KXWXWGQKFMWWAF CC(C)C(C)CCC(C)C1C=CC2C3CCC4C(CO)CCC4(C)C3CCC12C
UDZFBIDGNMQCJH C=C(CCC(C)C1CCC2C3CCC4C(CO)CCC4(C)C3CCC12C)C(C)C
WVNIISADYSWCOG CCC(CC(C)C)CC(C)C1CCC2C3CC=C4CC(O)CCC4(C)C3CCC12C
ZTFLQBFDIULXLJ CCC=C(CCC(C)C1CCC2C3CCC4C(CO)CCC4(C)C3CCC12C)C(C)C
ZFEMKNUYYBDBGZ CC(C)CC=CC(C)C1CCC2C3=CCC4C(CO)CCC4(C)C3CCC21C
XPRWWANUPMYKMF CC=C(CCC(C)C1CCC2C3=CC=C4CC(O)CCC4(C)C3CCC21C)C(C)C
KZJWDPNRJALLNS CCC(CCC(C)C1CCC2C3CC=C4CC(O)CCC4(C)C3CCC12C)C(C)C'>test.txt

This way, we can run :

singularity run cfm-4/cfm.simg -c "cfm-predict test.txt 0.001 /trained_models_cfmid4.0/[M+H]+/param_output.log /trained_models_cfmid4.0/[M+H]+/param_config.txt 0" | tee scratch/testout2.txt

Without parallelization, this is already long on 10 entries (I have >100k).

What I would like to do is to transform this efficiently into a slurm job array, but I am not sure how to do it properly.

Any help?

I think this might apply to many other single-threaded programs.

How long does it takes for the processing of one line?

It depends on the chemical structure behind the line…I would say 5-50sec?

Let say

  • each line is taking one minute
  • we want to stick to public-cpu partition. The time limit on this partition is 12h (12 * 60 => 720 minutes)
  • you have 100k lines to compute.
  • you can split your 100k file in 720 chunks.

Example: (adapt, in this example I split my file to have two lines per file with --lines=2)

[sagon@node025 ~] $ split --lines=2 --numeric-suffixes=1 --suffix-length=3 your_input_file your_split_file.
[sagon@node025 ~] $ ls -la your_split_file.*
-rw-rw-r-- 1 sagon unige 163 Mar  3 15:20 your_split_file.001
-rw-rw-r-- 1 sagon unige 166 Mar  3 15:20 your_split_file.002
-rw-rw-r-- 1 sagon unige 186 Mar  3 15:20 your_split_file.003
-rw-rw-r-- 1 sagon unige  96 Mar  3 15:20 your_split_file.004
-rw-rw-r-- 1 sagon unige  42 Mar  3 15:20 your_split_file.005

Create your sbatch script runcfm.sh:

#!/bin/sh
#SBATCH --partition=public-cpu
#SBATCH --time=12:00

ml GCC/9.3.0 Singularity/3.7.3-Go-1.14

#convert job array index to three digit padded with zeros
printf -v FILE_INDEX "%03d" ${SLURM_ARRAY_TASK_ID}

FILE=your_split_file.${FILE_INDEX}

srun singularity run cfm-4/cfm.simg -c "cfm-predict $FILE 0.001 [...]"

Launch your sbatch script:

sbatch --array=1-5 runcfm.sh

Instead of split your file as a pre processing, you may as well use sed or similar to output only the needed lines on each array instance.

1 Like

Thanks a lot! Exactly what was needed!

Last question:

Is there a rationale to choose between:

  • split 100k in 500 chunks → sbatch --array=1-500 mybatch.sh
  • split 100k in 5000 chunks → sbatch --array=1-5000 mybatch.sh
  • split 100k in 100k chunks → sbatch --array=1-100k mybatch.sh
    ?

Hi,

The risk is that each instance will spent more than 12h00 to finish. My calculation (720) was a safe value.

Why not, not sure it will be better than first option due to the overhead to start/stop small jobs.

This is many jobs. Max Job array size is 10k and global max job on the cluster is 60k!