your post was the opportunity to write a quick post about storage on Baobab that may be interesting for the users.
As you may understand from my post, it’s a bad idea to perform benchmarks yourself.
According to your post, you are reading the content of 178121 files to determine the performance of the storage. Please don’t.
If you really want to check the performance, there is a tool provided by BeeGFS: beegfs-ctl but it’s up to you to understand how to use it as it’s not intended to be useful for the enduser.
thanks a lot for the post! I am not sure it answered my question though.
I am most certainly not reading all of those files, as there is no need to wait for the command to complete, you can interrupt it after a few reads. It’s meant as an effectively infinite-duration command, I never ran after just several files are read, as it is enough to assess the performance.
Sorry I understand now this was unclear.
So actually my question was if there is currently a monitoring a user can access to see if the disk is loaded, and even if my jobs happen to load it.
It was certain at the point when I wrote the post that the problem was very acute. It effectively blocked my work (I myself had only a few jobs at that time).
How often does this happen? Has it been reflected in some user-accessible log?
Such a report would seem preferable to manipulating beegfs-ctl myself, since I would not really want to setup my own monitoring, or anything else that permanently runs on the hpc cluster.
I can see to add something with beegfs-ctl though, if that’s the best option.
By the way, in this case, my “find” was not the issue (though I indeed understand that in some situations it is), since the problem was very clear also with simply reading a singularity image with a known location.
just to be clear: I’m pretty sure as well you weren’t the cause of the slowness of the scratch space, no worries. I understand it’s an issue when it’s that slow. We had two internal issues that may have been the cause of the poor performance. It could be indeed interesting for the users to have access to a performance graph about the storage. We’ll think about it, we need to figure out the good metrics and graph them.
Indeed, this only happened during about one day, so it’s not really a big concern at all if it was a rare internal issue.
Just that I can not be entierly sure there were no other possibly short-term episodes, hence the question.
Good to know it might become available!
I also added some small test (~10Mb read with dd) at the beginning of every job. It does not increase the load substantially, since it’s few % of what the job reads regularly.
I understand and fully appreciate your point about users not creating unnecessary load by doing independent monitoring, but this just basically stats on my jobs, I suppose that’s ok.
Here is how it evolved over yesterday. Min, Max, Average in each time bin. I am not sure about one outlier upwards, maybe got cached somehow.
Edit: size of the dot corresponds to the number of reads (i.e. number of my jobs in that time bin)