Job Array summary information

I recently started launching a big quantity of cpu job using job array and was wondering if there was a mechanism to get the status of a job array as a whole.
For example knowing the exit status of the whole job array. I would consider a job array as completed if they have all completed and failed if at least one has failed. I would also consider a job array as running if there is no failed job and at least some of them are running.

This feature would be useful for me to get an idea if a job fail before seeing it when I do post processing.

Thanks in advance for your answers.

Dear Pablo,

good question… I don’t know the answer.

According to job_array.html, you can still set a dependent task which is triggered in case the job array is completed without error or with error. You can create a task that do nothing but notify you per email for example.

Task to be run after all the elements of a job array are completed successfully, where 123 is the array id:

sbatch --depend=afterok:123 my.job

The same concept, but will start if any of the element fails:

sbatch --depend=afternotok:123 my.job

Remember to cleanup the task that won’t start.

Thanks for your answer. I just found out that using the mail feature for a job array send email for the start up and completion of the job array as a whole.

#SBATCH --mail-type=ALL

I received email similar to

Slurm Array Summary Job_id=29801063_* (29801063) Ended, COMPLETED, ExitCode [0-0]

I haven’t yet have a failed job array but I guess if all job of a job array fails I will have my email box spamed.

I had a job just failing and only got one email for the whole job array.
To get exactly which job of the job array has failed we can use:

sacct -j JOBID --state=failed