Dear HPC team,
I noticed that, despite I asked 7 days on my partition (private-gapnl-cpu), after 1 day the simulation stops `due to time limit’. Could you please check its setting?
Best wishes
Maura
Dear HPC team,
I noticed that, despite I asked 7 days on my partition (private-gapnl-cpu), after 1 day the simulation stops `due to time limit’. Could you please check its setting?
Best wishes
Maura
Dear @Maura.Brunetti,
are you using sbatch or salloc? Can you please share a job id which has the issue?
Best regards
Yann
Dear Yann,
I also have the same problem with the private-gapnl-cpu partition, where the simulations stop after a couple of hours without any reason. When I launch the same simulation, but with the public-cpu partition, it runs for 4 days.
Do you have an explanation for this behavior?
Best,
Laure
Dear @Laure.Moinat I have the same question for you as I did for Maura:)
Dear Yann,
This is how we launch our simulations ‘srun --ntasks=25 --partition=public-cpu --time=4-00:00:00 --multi-prog P280.conf > std_outp 2>&1’. I currently have no job ID that is running with this issue. I can launch one if necessary.
Best,
Laure
Thanks for the feedback. Unless used for a good reason, do not use srun to launch long running tasks. If you disconnect from the login node the job is killed. If the login node has an issue and we restart it (this happened multi time since beginning of this year) the job is killed also.
Create an sbatch script and launch it using sbatch <your script>.
Example to create <your script>
#SBATCH --ntasks=25
#SBATCH --partition=public-cpu
#SBATCH --time=4-00:00:00
#SBATCH --multi-prog
srun P280.conf > std_outp 2>&1’.
Thanks for the clarification and for the example, I will change my scripts!
Best,
Laure