Since the beginning of this week, I am experiencing very slow access to storage on Baobab’s scratch. Is this maybe caused by some hardware/software issue that can be solved? Or people returned to Baobab after a break and we should get used to it?
Thank you in advance!
If you are asking for help, try to provide information that can help us solve your issue, such as :
what did you try:
Read data stored in scratch
what didn’t work:
Quickly reading data stored in scratch
what was the expected result:
A quick read of data stored in scratch
what was the error message:
No error message, but wasted resources while waiting for data reading to complete.
path to the relevant files (logs, sbatch script, etc):
I checked all beegfs indicators and all is green : the Storage is not overloaded and the queue for io is not full. (Actually we check if the number of item written/read is higher than the numeber of write/read requests)
So could you please give me more details on the “slow access” is there a folder where I can begin to investigate?
How do you feel the latency? Do you have an error? Does your job are more longer as usual?
We need a minimum of information to point the issue and resolve it.
For example, I had to wait for the output of ls /home/users/f/falkiewi/scratch/calibrated-posterior/workflows/calnpe/*/output/estimator/*/128/0.001/ (observing how it progressively lists directories), while in the previous week, it appeared immediately.
Also, some of my jobs got TIMEOUT due to problems with reading/writing data.
Today, listing seems to be much faster (although, there is still a gap to the previous week’s performance). Let’s keep our fingers crossed for today’s jobs!
If your monitor doesn’t show anything there is nothing that can be done. Probably, there are factors not captured by the monitoring.
Currently, I have multiple jobs that are in D state (as displayed by htop) for several hours without even starting any computation, while in a healthy situation, they should be completed in ~4 hours.
You can have a look at 66431425_1, 66431425_2, 66431425_3, 66431425_0. And for comparison, there is 66431425_4 with the same computation to be done (but different random initialization), same start time, but different node, completed in 03:20:53.
@Gael.Rossignol do you have any idea why this is happening? Unfortunately, there are tens of other jobs in this state. That way hundreds of compute hours get wasted.
I notice that to access to scratch you use following path
But this is not the more efficient way, because in this case you create a first read on the beegfs home then a second to the beegfs scratch. Keep in mind that the direct access to scratch can be found in the following path and will be more efficient :
The jobs timeouted and I rescheduled them. Probably some of them will again end up in a similar state, and I will schedule them again. After a number of iterations, I hope to see all the experiments completed
There is no solution that I can find on my side.
In my scripts, I use /srv/beegfs/scratch/users/f/falkiewi/.