We have been notified that some users have been contacted by other users because:
I also have to use resources, and I can't launch my jobs.
How long will the work take?
How/why are you using these resources. I need them too.
Put yourself in the shoes of …
For you it’s maybe nothing but for the recipient it’s very intrusive. It means your are spying his work, analysing the resources and judging your work is more important that his.
Will you accept to justify the use of the cluster resources to another user, whom you probably don’t know?
This is inappropriate and not approved by the hpc team.
Yes but I noticed something is going wrong!
If you don’t know why or who ? you can post on hpc community
HOWEVER if you are suspecting another user doing something wrong, you can notifying us by email. hpc-community is not a wall of shame.
The HPC team will take care of it as soon as possible.
This kind of behavior is approved by the hpc team.
Thank you for your understanding.
The behavior of some users on the cluster is completely unacceptable and plain disrespectful toward other users.
Hundreds of jobs launched, 24/7, with super low GPU/resource utilization.
Some people on my team have already contacted you, and you have refused to take any action.
I have spoken with tens of other researchers who are exasperated by this.
This is definitely more intrusive than what you are referring to, and people who work in this way are the ones “judging that their work is more important than others’”.
Thank you for your understanding,
To the list of unacceptable behavior, I would add running computations on login nodes.
If I were to notify HPC team every time I see it, it would be a 24/7 job.
It seems you have been misinformed, we spent several hours in October with someone from your group and we were open minded to the proposals.
- We talked about the fairshare issue that is being resolved.
- We have found different solutions to meet their needs. (like reservations on your private partition)
- We have also contacted all reported or identified users who are not using best practices.
- We deployed new sysadmin tools to be more reactive. Storage performance issues and login node load issue have been greatly reduced.
- Last week 22 gpu cards have been installed on baobab increasing the availability: New computer installed gpu[032-035].baobab
- GPU are now allocated based on their compute capacity (low end models are allocated first) Baobab scheduled maintenance: 28-29 September 2022 - #4 by Yann.Sagon
- Recently, we informed some users by email of a way to restrict their jobs to use specific gpu nodes according to their needs. (More information here)
- Find a way to limit the number of GPU cards used at the same time per user.
- Installation of newly ordered GPUs on Baobab
- New cluster installation: Bamboo
This is your interpretation but not the reality. You are working on an academic cluster, some of users
does not have HPC knowledge and learn by trying, so they don’t have any bad intentions. For the past 2 months, new HPC training workshops have been created and provided by the SciCos team to educate new users.(Teaching & workshops - SciCoS - Scientific Computing Support - UNIGE) If I remember well, this has been discussed with people of your group.
With all this elements, I do not consider that we refused to take any action. We always emphasize the fair use of the cluster considering the private nodes and their priorities.
For any questions/suggestions/issues, feel free to join the hpc-lunch meeting held every first Thursday of the month (more information here)
Your lovely HPC team
Dear HPC team,
we appreciate all the actions you’ve already taken or planning to take, but at the end of the day what matters to us, users, is to have a convenient way of running jobs on the cluster. Despite your efforts, some users are, knowingly or unknowingly, misusing it. Very often this means that the cluster becomes a quite inconvenient place to run experiments.
I’m worried that if nothing happens that actually discourages this behaviour, then the users willing to misbehave have an advantage over the ones who are not and the Nash-equilibrium is that more and more users will stop playing fair, turning the cluster into complete chaos.