8 posts were split to a new topic: Issue with quota and x2go
It is down yet again. I get “Timeout connecting to yggdrasil” when I try to connect via X2Go. SSH never connects either.
I can connect to bamboo without issue, so the problem is not something on my end.
May I ask: why does this happen so often? Scrolling back through the history of this thread, I can see that many times it is due to a user abusing the resources on the login node.
Why are users allowed to reach such a point with the resources on the login node in the first place, where they can end up bringing down access to the entire cluster? Why not restrict the amount of resources available to users on the login node, so this does not happen?
This is particularly problematic when something like this happens on a weekend, meaning that the issue will not be fixed until the beginning of the week. For me currently, that is 2 days of work lost with major deadlines approaching. On my part, I should definitely now move towards being more flexible, by duplicating my work onto other clusters. However, it is still frustrating that this happens so often, and yet a long term solution has still not been implemented.
Another possible solution: given that this does happen so often, could an automatic rebooting of the login node be implemented, so it is rebooted when this occurs without the need for staff intervention?
We apologize for the inconvenience caused. The login node crashed without generating any relevant logs to help us identify the root cause. This time, it does not appear to be due to a user overusing the resources.
Currently, we don’t have a mechanism in place to limit resource usage on the login node. It is on our To-Do list, but we have many other ongoing tasks (maintenance, Bamboo installation, daily support, etc.). Unfortunately, our HPC team consists of only three members, and we’re managing three clusters, so we’re doing the best, some technical solutions may take some time to implement depending on priorities.
Implementing an automatic reboot during off-hours could be a good workaround until a more permanent solution is in place. We’ll discuss this option further.
For now, the login node has been rebooted.
Turning it off and on again right now, for whatever reason?
I detected it 3 minutes ago and restarted it.
edit: I have locked the post to prevent people to “re open” this thread. Even if you have the same issue again, please open a new post, it helps us to sort the issues.
It appears to be down again. I am assuming that you are aware and that this is part of, what I am guessing, is ongoing troubleshooting of the down/draining nodes, and/or the electrical issues.
Hello @William.Ceva
Login1.yggdrasil has been restarted.