Good usage of resources on HPC cluster

In general, you should stick with the safe margin. You have another possibility if your job your software support a checkpointing mechanism. In this case, SLURM can notify the job that it will be killed with some minutes margin. When the job receive this information, it should write the checkpoint. You can then relaunch the job a second time. This has an advantage, because you can ask even less time, for example one hour per job, and you’ll have much more opportunity to have you job picked by the scheduler.

1 Like