Home very slow on baobab

Hi,

Baobab is almost unusable for me at the moment. It looks like the problem is with the home disk, I guess because it is near full.

Also would it be possible to introduce an automatic sweep to delete old files? It seems like we’ll never get below 90%!

Cheers,
Johnny

Everything is exremely slow for me as well.
Cheers,
Halit

Hello all,
Looks like home is slow again.

Home is indeed very slow. Trying to clean up old data, SSH connections tend to be killed (“broken pipe”) in the middle of a “rm”.

Same here, extremely slow…

Yes it is slow. It is slow because many users are doing too much IO on the storage.

We contacted several users already asking to change how they use the cluster but there are many users running jobs on the cluster, we aren’t able to watch what every user is doing, we’ll lost the whole day doing so. Instead we rely on the fact that users try to follow the best practice guide we wrote.

As soon as you are running a lot of simultaneous jobs on the cluster, it always worth to be sure this isn’t producing a high impact on the cluster. You can check the BeeGFS health on our monitoring server and see if the load increase as soon as your jobs starts.

Will jump in and say that the extremely slow I/O on the disk is actually stopping my workflow entirely. I use dask, which spawns workers that must report back to the scheduling task.

The problem is that the workers take so long to spin up (due to extremely slow reading of python files from disk) that the scheduler abandons them before they are active thinking they have timed out.

If this is caused by some users abusing the system, maybe they should be restricted in how many jobs they can run etc. somehow?

Is there anything we can do to improve the speed on baobab? Opening a file takes more than 1 minute for me today… I already cleaned up my home and scratch.

Is there a way of knowing if we are “bad users”, i.e. if we are doing “too much IO”? I have a bunch of simulations running, which write periodically on scratch, but I don’t know if I’m abusing.
Thanks!

Hi,

Just checking process run by your username:

[...]
bolmonte  81525  0.2  0.0   3584    44 ?        D    17:47   0:00 sh -c cut -d " " -f 1-3 /proc/loadavg
bolmonte  81616  0.3  0.0   5648    44 ?        D    17:47   0:00 sh -c cut -d " " -f 1-3 /proc/loadavg
bolmonte  81645  0.2  0.0   5648    44 ?        D    17:47   0:00 sh -c cut -d " " -f 1-3 /proc/loadavg
bolmonte  81680  0.1  0.0   3584    44 ?        D    17:47   0:00 sh -c cut -d " " -f 1-3 /proc/loadavg
[...]
[root@login2.baobab ~]# ps aux | grep bolmonte | wc -l
411

so it seems you have 411 process opened on login2 and they seems to be stuck.

And it seems they are relaunched all the time. You probably submited them in a loop or something like that.

And a reminder for the users:

  • don’t try to benchmark the cluster, this just add an unnecessary overhead.
  • don’t use the login node instead of compute nodes. If you have small tasks to run, you may use the debug-cpu compute nodes for that purpose.

I’d like to know what those processes are. It’s not quite clear to me. I know I have jobs running on the nodes (which I launched using slurm and all).
Are those the processes you are referring to?

→ If so, those are long simulations, which relaunch automatically every 7 days (for a max of 4 or 5 times). They are not supposed to be stuck though, it’s normal they are very long. Also, I launched them on private nodes, so that it’s not a bother for other users.

→ If not, I’m not aware of these processes… :sweat_smile: and I should probably stop them… I was kicked out off baobab. I’m trying to reconnect but cannot…

I cannot connect to baobab either. I get this error message:
ssh: connect to host baobab2.hpc.unige.ch port 22: Connection refused

Answer for all the users:

Yes, this was my workaround to speedup Baobab :smirk:.

I’m restarting login2, so you’ll start the week end with a healthy login node, please be kind to keep it in good shape.

Same problem:
ssh: connect to host baobab2.hpc.unige.ch port 22: Connection refused

Many thanks Yann!
And yes, maybe a training session could be nice!
Have a great weekend!