I found that pretty quickly on the HPC docs, but thanks for sharing it here.
I have started spinning my wheels on submitting a job that restarts until its completed. But I am getting a little stuck. I hope you, or someone else can help me with it. Let me know if I should start a new thread for it.
The checkpoints available for beast seem pretty straight forward:
beast -load_state filename -save_state newfilename [other commands] input.xml
But I need to tell my script to look at the old load_state
to make a new save_state
; because if the log file exists beast wont start (input.xml
turns into input.log
).
my original slurm script is a pretty simple array that submits 16 jobs for 8 independent beast runs for two parallel analyses:
#!/bin/bash
#SBATCH --time 96:00:00 # in hours
#SBATCH --partition public-cpu
#SBATCH --cpus-per-task 16 # number of cores to use
#SBATCH --mem 80000 # in MB
#SBATCH --array 1-16 # run N jobs
#SBATCH --job-name BEAST_rerun
# to run: sbatch submit_jobs.sh
# be sure to adjust the length of your array based on your list
# get yer modules loaded
module load GCC/10.2.0 beagle-lib/3.1.2 Beast/1.10.5pre
# set variables
WORKING_DIR="/home/users/c/cardenac/phylo/2023_adephaga/BEAST_rerun"
ALIGN_LIST="jobs.list"
# your list should look like this:
# run01/core_SORTADATE_beast_exponential_sans-quasicalathus.xml
# run01/core_SORTADATE_beast_lognormal_sans-quasicalathus.xml
# run02/core_SORTADATE_beast_exponential_sans-quasicalathus.xml
# run02/core_SORTADATE_beast_lognormal_sans-quasicalathus.xml
# run03/core_SORTADATE_beast_exponential_sans-quasicalathus.xml
# ...
# get info from list & use awk to print that line based on the array where the line is based on the ${SLURM_ARRAY_TASK_ID}
DATASET_DIR=$(cat ${ALIGN_LIST} | awk -v var=${SLURM_ARRAY_TASK_ID} 'NR==var {print $0}' | cut -d "/" -f 1)
XML_FILE=$(cat ${ALIGN_LIST} | awk -v var=${SLURM_ARRAY_TASK_ID} 'NR==var {print $0}' | cut -d "/" -f 2)
# change directories to where data lives and run script
cd ${WORKING_DIR}/${DATASET_DIR}
beast -threads 16 ${XML_FILE}
I thought maybe using some kind of logical statement for writing a name to the load_state and file_state inputs might work and then wrapping my original command in srun… but I’m starting to get a little lost
#SBATCH --time 04:00:00 # in hours
#SBATCH --partition public-cpu
#SBATCH --cpus-per-task 16 # number of cores to use
#SBATCH --mem 80000 # in MB
#SBATCH --job-name BEAST_rerun
#SBATCH --array 1-16 # run N jobs
# to run: sbatch submit_jobs.sh
# be sure to adjust the length of your array based on the number of lines in your list
module load GCC/10.2.0 beagle-lib/3.1.2 Beast/1.10.5pre
# set variables
WORKING_DIR="/home/users/c/cardenac/phylo/2023_adephaga/BEAST_rerun"
ALIGN_LIST="jobs.list"
# your jobs.list should look like this:
# run01/core_SORTADATE_beast_exponential_sans-quasicalathus.xml
# run01/core_SORTADATE_beast_lognormal_sans-quasicalathus.xml
# run02/core_SORTADATE_beast_exponential_sans-quasicalathus.xml
# run02/core_SORTADATE_beast_lognormal_sans-quasicalathus.xml
# run03/core_SORTADATE_beast_exponential_sans-quasicalathus.xml
# ...
# get alignment path from list
DATASET_DIR=$(cat ${ALIGN_LIST} | awk -v var=${SLURM_ARRAY_TASK_ID} 'NR==var {print $0}' | cut -d "/" -f 1)
# use awk to print that line based on the array, where the line is based on the ${SLURM_ARRAY_TASK_ID}
XML_FILE=$(cat ${ALIGN_LIST} | awk -v var=${SLURM_ARRAY_TASK_ID} 'NR==var {print $0}' | cut -d "/" -f 2)
# set your prefix for checkpointing
PREFIX=$(cat ${ALIGN_LIST} | awk -v var=${SLURM_ARRAY_TASK_ID} 'NR==var {print $0}' | cut -d "/" -f 2 | cut -d "." -f 2 )
# need to append the job step to the run otherwise beast wont restart due to "the log file already exists"
if [ ! -f finished_${PREFIX} ] ; then
# create a file to count the number of times beast has run
# first check that the count file has been made
if [ ! -f ${XML_FILE}_jobcount ] ; then
touch ${XML_FILE}_jobcount.list
fi
# add your first XML file to your jobcount.list
ls -1 ${XML_FILE} >> ${XML_FILE}_jobcount.list
# counts the number of lines in your jobcount.list and makes a a count prefix
COUNT_PREFIX=$(awk '{LIST_COUNT=FNR+1} {print LIST_COUNT}' ${XML_FILE}_jobcount.list)
# okay neat we've got counts for a prefix, figure out how to call the original file
COUNT_RUNS= #hmmm
# maybe use the wrap function with srun?
srun --dependency=afterany:$SLURM_JOBID \
# --wrap ' # silencing this for my text editor!!
cd ${WORKING_DIR}/${DATASET_DIR}
beast -threads 16 -prefix ${XML_FILE}_run${COUNT_RUNS} -load_state filename[????] -save_state ${XML_FILE}_run${COUNT_RUNS}_checkpoint
# '
fi
Am I over thinking it?
admin edit: created as new post