I recently started launching a big quantity of cpu job using job array and was wondering if there was a mechanism to get the status of a job array as a whole.
For example knowing the exit status of the whole job array. I would consider a job array as completed if they have all completed and failed if at least one has failed. I would also consider a job array as running if there is no failed job and at least some of them are running.
This feature would be useful for me to get an idea if a job fail before seeing it when I do post processing.
Thanks in advance for your answers.
good question… I don’t know the answer.
According to job_array.html, you can still set a dependent task which is triggered in case the job array is completed without error or with error. You can create a task that do nothing but notify you per email for example.
Task to be run after all the elements of a job array are completed successfully, where 123 is the array id:
sbatch --depend=afterok:123 my.job
The same concept, but will start if any of the element fails:
sbatch --depend=afternotok:123 my.job
Remember to cleanup the task that won’t start.
Thanks for your answer. I just found out that using the mail feature for a job array send email for the start up and completion of the job array as a whole.
I received email similar to
Slurm Array Summary Job_id=29801063_* (29801063) Name=fixMissing.sh Ended, COMPLETED, ExitCode [0-0]
I haven’t yet have a failed job array but I guess if all job of a job array fails I will have my email box spamed.
I had a job just failing and only got one email for the whole job array.
To get exactly which job of the job array has failed we can use:
sacct -j JOBID --state=failed