@support It looks like Yggdrasil login node became unusable. I can see tens of CPU-heavy processes including the execution of some R code.
EDIT: RAM consumption is also worrying. There are hundreds of “dead” processes (not using CPU) that utilize RAM. These are some MATLAB, python, jupyter, MPI, firefox (!!!), and other strange things. Isn’t this a great misuse of the login node?
yes, you are right, this is against our best practice. Everyone should re read them… 10 time!
As the best practice aren’t always followed, we write from time to time to users clearly abusing the login node or we directly kills their job if too much resource is used.
ps: always start a new thread when posting on the forum a new topic. I’ve moved your post.
If you are talking about those process:
Do not worry, they aren’t dead process and not each process is using 24G RAM. They are using shared memory, only allocated once. And if Matlab is doing nothing, this is perfectly normal.
Anyway to avoid to clutter login node, we suggest to use to use
public-interactive-cpu partition as explained here.
Do you know what are the reasons for such a bad user experience at the Yggdrasil login node? Is it because of too many users accessing it at the same time? Is the main issue in disk access or maybe internet connection? I have reported problems with running simple git commands (Git pull problems on Yggdrasil). Now I have a problem with pulling data from the S3 service that you provide. Things that normally should take seconds run tens of minutes.
As a comparison I have:
- experience on my local machine - if something takes ages here I am expecting it to take ages elsewhere, but this is not the case
- experience on Yggdrasil several weeks ago - everything was perfect and then suddenly Yggdrasil became unusable
- experience on Baobab - in the old times when Yggdrasil was running well Baobab was unusable. Now it is the other way around. Same git/S3 pulls that never finish on Yggdrasil take seconds on Baobab.
I know you are trying to do your best and I am not complaining. But I would like to know if there is hope. I need to reorganize my research and prepare resources for my own infrastructure otherwise.
I did some cleanup: I contacted several users to ask to stop to use login node and killed some process. The situation is better.
Anyway: the storage is still suffering a little bit but for my test it is well usable. I could do a git pull with reasonable time.
Which S3 do we provide please? Can you share an url so we can try ourself?
If you want to evaluate if the issue is related to the storage: you can do a git clone in /tmp which is a local filesystem. If the speed is normal, yes indeed the issue is with storage. In this case, not only login1 is involved as all the compute nodes are using the storage too.
Sorry for the late response, I was away for a while.
Right now everything works smoothly both in
/tmp. If I will encounter such problems in the future I will re-evaluate
The address of S3 is
https://kalousis.hcpdufour.unige.ch. But as said above - now it works well. No need to evaluate it on your side.
Just a quick note: this is Hitachi hcp, not hpc! This service is provided by another team. But anyway if it slow, the reason can be the access or the storage itself. Feel free to contact us again if the problem arise.
Is there anything wrong with Yggdrasil right now? Loading
bash on the login node already takes time. Not even talking about any interactions with
scratch. However, it looks like there is plenty of storage available. Also, CPU of the login node is not under load. However, there is plenty of RAM in use.
EDIT: I am trying to download data from the S3 mentioned above, and the ETA for around 42MB of files is more than 4 hours These are many small files, so I accept it has to last, but 4h is a bit too much. At the same time downloading from the same S3 to my local PC, connected with not the best internet runs fast.