Hi HPC team,
Primary informations
Username: dedenon
Cluster: Baobab
Description
I ran a CPU job array last Friday on our private partition kruse-cpu, and I got cancellation of more than half of them, but the remaining ones are completed.
Here is the sbatch file
I don’t think the issue is in my code, because I do parametric scanning for simulations with 20 independent realizations for 5 000 000 time steps, and some of them got completed (see slurm output below)
job 91
but others are cancelled…
job 92
Here are the seff reports for those jobs, this is neither TIMEOUT or OUT-OF-MEMORY issue
Other example from job 115
Finally I checked for node status in the partition, I got this
Steps to Reproduce
By essence it is not a kind of reproducible problem, this is something that already happened and I can’t figure out what triggers this… It happened several times and I just noticed that there is no issue if I launch small job arrays (~100), but large ones (~1000) get systematically cancelled.
One interesting information: looking at the slurm output files, all uncompleted jobs seem to have been cancelled at 08am on Saturday!
I will try on another partition to see if it changes anything.
Thanks by advance for your help on this !
Best,
Mathieu D.