Hi,
I am trying to run efficiently cfmid on the HPC.
What I did until now:
module load GCC/6.3.0-2.27 Singularity/2.4.2
singularity build --sandbox cfm-4/cfm.simg docker://wishartlab/cfmid
Having the cfm.simg
I can then:
singularity run cfm-4/cfm.simg -c "cfm-predict 'CC(C)NCC(O)COC1=CC=C(CCOCC2CC2)C=C1' 0.001 /trained_models_cfmid4.0/[M+H]+/param_output.log /trained_models_cfmid4.0/[M+H]+/param_config.txt 0" | tee scratch/testout.txt
This is only to check everything works fine (it does).
We can so create a small test file:
echo 'HVYWMOMLDIMFJA CC(C)CCCC(C)C1CCC2C3CC=C4CC(O)CCC4(C)C3CCC12C
BWGQUGBECNWWDB CC(C)CC=CC(C)C1CCC2C3CCC4CC(O)CCC4(C)C3CCC12C
NYWZDGGKTLARLX C=C(CCC(C)C1CCC2C3CCC4CC(O)CCC4(C)C3CCC12C)C(C)C
KXWXWGQKFMWWAF CC(C)C(C)CCC(C)C1C=CC2C3CCC4C(CO)CCC4(C)C3CCC12C
UDZFBIDGNMQCJH C=C(CCC(C)C1CCC2C3CCC4C(CO)CCC4(C)C3CCC12C)C(C)C
WVNIISADYSWCOG CCC(CC(C)C)CC(C)C1CCC2C3CC=C4CC(O)CCC4(C)C3CCC12C
ZTFLQBFDIULXLJ CCC=C(CCC(C)C1CCC2C3CCC4C(CO)CCC4(C)C3CCC12C)C(C)C
ZFEMKNUYYBDBGZ CC(C)CC=CC(C)C1CCC2C3=CCC4C(CO)CCC4(C)C3CCC21C
XPRWWANUPMYKMF CC=C(CCC(C)C1CCC2C3=CC=C4CC(O)CCC4(C)C3CCC21C)C(C)C
KZJWDPNRJALLNS CCC(CCC(C)C1CCC2C3CC=C4CC(O)CCC4(C)C3CCC12C)C(C)C'>test.txt
This way, we can run :
singularity run cfm-4/cfm.simg -c "cfm-predict test.txt 0.001 /trained_models_cfmid4.0/[M+H]+/param_output.log /trained_models_cfmid4.0/[M+H]+/param_config.txt 0" | tee scratch/testout2.txt
Without parallelization, this is already long on 10 entries (I have >100k).
What I would like to do is to transform this efficiently into a slurm job array, but I am not sure how to do it properly.
Any help?
I think this might apply to many other single-threaded programs.
How long does it takes for the processing of one line?
It depends on the chemical structure behind the line…I would say 5-50sec?
Let say
- each line is taking one minute
- we want to stick to public-cpu partition. The time limit on this partition is 12h (12 * 60 => 720 minutes)
- you have 100k lines to compute.
- you can split your 100k file in 720 chunks.
Example: (adapt, in this example I split my file to have two lines per file with --lines=2
)
[sagon@node025 ~] $ split --lines=2 --numeric-suffixes=1 --suffix-length=3 your_input_file your_split_file.
[sagon@node025 ~] $ ls -la your_split_file.*
-rw-rw-r-- 1 sagon unige 163 Mar 3 15:20 your_split_file.001
-rw-rw-r-- 1 sagon unige 166 Mar 3 15:20 your_split_file.002
-rw-rw-r-- 1 sagon unige 186 Mar 3 15:20 your_split_file.003
-rw-rw-r-- 1 sagon unige 96 Mar 3 15:20 your_split_file.004
-rw-rw-r-- 1 sagon unige 42 Mar 3 15:20 your_split_file.005
Create your sbatch script runcfm.sh
:
#!/bin/sh
#SBATCH --partition=public-cpu
#SBATCH --time=12:00
ml GCC/9.3.0 Singularity/3.7.3-Go-1.14
#convert job array index to three digit padded with zeros
printf -v FILE_INDEX "%03d" ${SLURM_ARRAY_TASK_ID}
FILE=your_split_file.${FILE_INDEX}
srun singularity run cfm-4/cfm.simg -c "cfm-predict $FILE 0.001 [...]"
Launch your sbatch script:
sbatch --array=1-5 runcfm.sh
Instead of split your file as a pre processing, you may as well use sed
or similar to output only the needed lines on each array instance.
1 Like
Thanks a lot! Exactly what was needed!
Last question:
Is there a rationale to choose between:
- split 100k in 500 chunks → sbatch --array=1-500 mybatch.sh
- split 100k in 5000 chunks → sbatch --array=1-5000 mybatch.sh
- split 100k in 100k chunks → sbatch --array=1-100k mybatch.sh
?
Hi,
The risk is that each instance will spent more than 12h00 to finish. My calculation (720) was a safe value.
Why not, not sure it will be better than first option due to the overhead to start/stop small jobs.
This is many jobs. Max Job array size is 10k and global max job on the cluster is 60k!