I have been running some jobs on Baobab/Yggdrasil, and sometimes, after I run many at once, I can not run anything for some day(s), because of priority. Which is understandable.
But ideally, I would want to ensure I am able to run all the time, even if few jobs at one, and expand as much resources as possible when possible.
How do I optimize the usage to achieve this? Seems as it should be possible to reduce usage when some limit is approached. But it probably requires some possibly complex customization.
Also, perhaps ensuring private project partition would result in desired behavior? Is this the best way?
In my experience one way (not always possible) to have quick availability is to minimize the ressource usage in time, memory and number of core of a single job. A task that can be cut in small computational step or that can be checkpointed and resumed is ideal. The –time-min option allows slurm to reduce the time limit as much as needed to be able to start sooner. This flag is ideal in case of checkpointed job that can be interrupted and rerun.
The reason why running small jobs has very short queue is because of the backfill algorithm which will try to put jobs in holes. If you job fit where no other job fit you will be launched first even if your priority is worse. In practice I have previously launched a lot of 15 minutes single core jobs all doing a single operation (job array).
Almost all my jobs are very short (down to 5-10 minutes), which does enhance the availability.
My problem, as I point out in the post, happens after multiple 10s of 1000s of jobs, which at some point makes my priority too low to run even a small 10m job, apparently. I assuming I interpret it correctly.
To put it simple, I want to keep it so that my resource consumption never reduce my priority below this limit.
I guess it is impossible to guarantee, since the capability to run a job depends on what other users are doing.
But I suspect it should be possible to approximate this requirement by, maybe, adjusting QOS, or customly inspecting current/past usage/priority before submitting (which is more complex).
Since you @Pablo.Strasser run small jobs, and probably many of then, did you face similar issue? Or your jobs are never blocked by priority?
The last time I run theses jobs was something like 6 months ago and my problem of priority was mainly having my own job blocking the priority. You can compute the score that you have by using sprio. The shedule should normally select the highest priority job that can run in a hole. After it does depend on what the other are doing and the amount of free core because of scheduling left over (e.g someone requesting 19 cores that is sheduled on a 20 core node leaving 1 node available). If your job are very short a --time-min of a very small value like 2 to 3 minutes should make you start immediately if there is a node available with a core available. A last trick I use is to try to be as large as possible with the partitions you chooses even indicating multiple partitions if necessary as to having the most possible amount of candidate node. The --start option of squeue allows to have a generally pessimistic estimate of the wait time.
Thanks! I understand the principles, and I know how to optimize usage, and I appreciate further suggestions.
What I need is a well-defined way to make an assessment before submitting job if I should avoid submission, or even better if there is a way to set job QoS or other properties to make sure that if in the future I want to submit a job ASAP, I will be able to.
If there is not a definitive way to achieve this, I will combine assessments from sprio, start time estimates, etc, to make such an assessment.
I appreciate that there are many factors, some of the outside of my control, but I have a simple condition and I am sure there could be a solution which could be complex but minimally disruptive to my workflow.
Hi @Pablo.Strasser, @Yann.Sagon, all,
I asked this question on today’s HPC lunch, and got some interesting points:
as @Yann.Sagon stresses, it is not advised to specify multiple partitions at once except when combining private and public. Since the HPC team runs some script, which will kill if job is in partitions where it can not start. This script will kill the job if it is in at least one of the partitions where it can not start - while in fact slurm would eventually schedule it in the other partition.
So, specifying multiple partitions and at least one partition which will never accept the job will expose the job to possible termination by this script at some (regular?) moment.
it realized it might be possible to allow users to run jobs at least some jobs always if the priority would depend on the currently running number of jobs. I do not know if slurm supports this, quick look does not show. I.e. if the the user has no jobs running, his priority could be very high, irrespectively of the amount of resources consumed in the past.
we can organize private nodes with @Carlo.Ferrigno and @Pierre.Dubath, and/or share nodes with priority at the department level.
even if we will have private nodes, they might be blocked up to 12h by shared partition users. This is not acceptable if we want near-interactive reactivity - in our domain time-to-compute is sometimes critical.
We might think about ensuring some free spots are available always, or something like that even if it reduces total usage fraction.