It happen often, and in particular this week, that the cluster becomes unusable because of a high volume of jobs submitted by a single user. This is particularly frustrating because very often the GPU utilisation of these jobs is negligible and likely they can be run on CPU. This situation significantly slows down the research of many users. There must be some way of preventing this from occurring with the frequency it does now.
I agree with Sam here, getting wall time gets really difficult sometimes, even on partitions where we should have higher priority. Is there a suitable solution where all parties are happy with their wall time allocation?
Slurm manage the priority of allocation all is described here:
But if you think something is wrong, send us an e-mail and tell us which user is blocking the submission of the work. I will check.
PS: Do not forget to use the template, Thank you
I will follow up with an email, but I don’t think that this is necessarily restricted to a single user. For the time being it is, but in the future there could be another user monopolizing the cluster. I understand that slurm manages the priority allocation, but it looks like this system is at least partially configurable and it seems the current configuration is suboptimal for the needs of most users.
Thanks for the response!
The private partitions has a wall time of 7 days wheras the shared partitions has a wall time of 12 hours. Those which have access to the private partition have more priority on this partion by using
As a member of a private group, you have more priority on that private partition but there are other priority policies to consider to get the final priority.
Keep in mind that if you need certain resources for a very important project/publication, we can create a reservation for your needs. We are not monsters.
In conclusion, YOU are the user of the cluster and sometimes our system administration policy may not fit your needs. If you have any suggestions, feel free to make a post on HPC-community to share your idea.
Thank you !
We have noticed there is something strange in the calculation of the fairshare.
By reading the slurm documentation:
As of the 19.05 release, the Fair Tree algorithm is now the default,
and the classic fair share algorithm is only available if
*PriorityFlags=NO_FAIR_TREE* has been explicitly configured.
This morning we change this slurm configuration, you may encounter unexpected behavior on resource allocation. We are in testing phase.
thanks for looking into this and making the change - I think it has at least removed part of the issue!.
Unfortunately, it looks like since the change fairshare is now remaining fixed at 0 for all users. Is this something that will start to add up after some time (e.g. 2 weeks) or is there another option which needs to be enabled to use the the per-user raw usage?
We are aware about the fairshare set to zero. A bug were opened to schedmd to understand what is going wrong.
For now we have an answer, but it involves a change in each account settings and we must to discuss about it with the all HPC team before applying them. Also if it did not work it will be complicated to recover the origin settings if we are not prepared.
We expect the problem to be resolved by the end of next week.
Thank you for your understanding.