Partial cancellation of jobs, others completed fine

Mathieu.Dedenon · June 17, 2024, 7:57pm

Hi HPC team,

Primary informations

Username: dedenon
Cluster: Baobab

Description

I ran a CPU job array last Friday on our private partition kruse-cpu, and I got cancellation of more than half of them, but the remaining ones are completed.

Here is the sbatch file

I don’t think the issue is in my code, because I do parametric scanning for simulations with 20 independent realizations for 5 000 000 time steps, and some of them got completed (see slurm output below)
job 91

but others are cancelled…
job 92

Here are the seff reports for those jobs, this is neither TIMEOUT or OUT-OF-MEMORY issue

Other example from job 115

Finally I checked for node status in the partition, I got this

Steps to Reproduce

By essence it is not a kind of reproducible problem, this is something that already happened and I can’t figure out what triggers this… It happened several times and I just noticed that there is no issue if I launch small job arrays (~100), but large ones (~1000) get systematically cancelled.

One interesting information: looking at the slurm output files, all uncompleted jobs seem to have been cancelled at 08am on Saturday!

I will try on another partition to see if it changes anything.

Thanks by advance for your help on this !

Best,
Mathieu D.