Jobs with lots of I/O very slow

Hi everyone,

I need to run a number of jobs which require a lot of input-output with a large calibration database. I have been running a number of them on Yggdrasil lately with the data and the calibration database located on my home directory. These jobs are extremely slow, taking >10h per instance compared to ~30 minutes on the Lesta server at the Observatory. Looking at their progress, I see that there is a lot of latency which I guess is mostly driven by accessing the calibration database from the node. One possibility would be to copy the database locally on the node where the job is being run but that is impractical because it is fairly large (~10 Gb). On the other hand, putting all of this into a singularity image would require a lot of memory.

Can you think of a better solution than what I have listed here?

Cheers
Dominique

Hi,

can you give us more details such as:

  • number of jobs in parallel
  • type of DB
  • share your sbatch script
  • your home directory on Yggdrasil? Can you share the path?

Sure.

  • My home directory on yggdrasil is /home/users/e/eckertd
  • The database is located on /srv/beegfs/scratch/shares/astro/xmm/ccf
  • I was sometimes running up to 20 jobs in parallel, but the issue occurs even when I am running a single job
  • An example sbatch script:

#!/bin/sh
#SBATCH --job-name ID9178_spec
#SBATCH --error ID9178_spec_error.e%j
#SBATCH --output ID9178-spec_out.o%j
#SBATCH --partition private-astro-cpu
#SBATCH --time 2-00:00:00
#SBATCH --mem=8000
./launch_spec.csh ID9178 params.txt SDSSTG9178W_reg2.reg

Many thanks
Dominique

Hello @Yann.Sagon

Now it looks like I am not able to launch jobs anymore, I get the following message:

12062540 private-a ID9178_s eckertd PD 0:00 1 (ReqNodeNotAvail,UnavailableNodes:cpu[123-124,135-150])

Cheers
Dominique

1 Like

I see the same message

Hi this is fixed, it was a mistake from our part.

Hi,

If you have the same issue while running a single job, we should first try to improve the performance for this use case first.

Accessing the DB from the node to the scratch should should not be an issue. On Lesta I guess the infrastructure is the same. The issue can be that the storage is highly used and slow due to other already running jobs.

I checked a little bit your scripts: Another issue may be that you seems to use a local Python + numpy in your home directory instead of using the one provided using module? It means the bottleneck can be also be the home directory or the Python installation.

Right now, both storage are pretty fast and this shouldn’t be an issue.

You can see that the time to copy the whole DB from scratch to the node is only 31s:

(yggdrasil)-[root@cpu001 ~]$ time /bin/cp -af /srv/beegfs/scratch/shares/astro/xmm/ccf/* /tmp/test/

real    0m31.985s
user    0m0.057s
sys     0m6.241s

So if the bottleneck is really the file access latency, maybe it worth to pre copy the db to the node.

Can you try again a run and measure the time? Please send us an email when you start your jobs so we can monitor your job as well.

Best

Thanks a lot for your reply @Yann.Sagon and for fixing the issue with launching jobs.

I have just launched a new job, I’ll let you know whether it is now faster. I am now using the standard Python version:

(base) (yggdrasil)-[eckertd@login1 ID9178]$ which python
/opt/ebsofts/Python/3.8.2-GCCcore-9.3.0/bin/python

Hi @Yann.Sagon , things have not changed, my latest job has huge latency. For instance now it has done nothing for the past ~40 minutes.

Cheers
Dominique

Hi, I’m checking job 12066105 on node cpu145.yggdrasil.

Tracing your main job produces no output.

strace -p 121140
strace: Process 121140 attached

And tracing the parent (perl ) just produces a “wait”.

It seems that the backscale program is doing nothing, no IO, and is maybe stuck in an infinite loop. Are you the author of this script?

In the error log you have this line:

(yggdrasil)-[root@cpu145 ~]$ tail  /home/users/e/eckertd/XMM/Groups/ID9178_spec_error.e12066105
rm: cannot remove ‘/tmp/spe_18399’: No such file or directory

According to the htop output, you aren’t using Python from module. You should add this line to your sbatch script:

ml GCC/11.2.0  OpenMPI/4.1.1 Python/3.9.6 SciPy-bundle/2021.10

Thanks @Yann.Sagon . I didn’t write this script myself, it is part of a package distributed by the European space agency.

I am trying to update the code to the latest version now and will see whether the issue could be due to some bug in this specific executable.

Cheers
Dominique

Hi @Yann.Sagon , upgrading the corresponding package to the newest version has worked, the jobs now take less than an hour to complete. It means it was an issue with this particular version of the executable rather than with Yggdrasil. Apologies for that and thanks for your help.

Dominique