I need to run a number of jobs which require a lot of input-output with a large calibration database. I have been running a number of them on Yggdrasil lately with the data and the calibration database located on my home directory. These jobs are extremely slow, taking >10h per instance compared to ~30 minutes on the Lesta server at the Observatory. Looking at their progress, I see that there is a lot of latency which I guess is mostly driven by accessing the calibration database from the node. One possibility would be to copy the database locally on the node where the job is being run but that is impractical because it is fairly large (~10 Gb). On the other hand, putting all of this into a singularity image would require a lot of memory.
Can you think of a better solution than what I have listed here?
can you give us more details such as:
- number of jobs in parallel
- type of DB
- share your sbatch script
- your home directory on Yggdrasil? Can you share the path?
Now it looks like I am not able to launch jobs anymore, I get the following message:
12062540 private-a ID9178_s eckertd PD 0:00 1 (ReqNodeNotAvail,UnavailableNodes:cpu[123-124,135-150])
Hi this is fixed, it was a mistake from our part.
If you have the same issue while running a single job, we should first try to improve the performance for this use case first.
Accessing the DB from the node to the scratch should should not be an issue. On Lesta I guess the infrastructure is the same. The issue can be that the storage is highly used and slow due to other already running jobs.
I checked a little bit your scripts: Another issue may be that you seems to use a local Python + numpy in your home directory instead of using the one provided using module? It means the bottleneck can be also be the home directory or the Python installation.
Right now, both storage are pretty fast and this shouldn’t be an issue.
You can see that the time to copy the whole DB from scratch to the node is only 31s:
(yggdrasil)-[root@cpu001 ~]$ time /bin/cp -af /srv/beegfs/scratch/shares/astro/xmm/ccf/* /tmp/test/
So if the bottleneck is really the file access latency, maybe it worth to pre copy the db to the node.
Can you try again a run and measure the time? Please send us an email when you start your jobs so we can monitor your job as well.
Thanks a lot for your reply @Yann.Sagon and for fixing the issue with launching jobs.
I have just launched a new job, I’ll let you know whether it is now faster. I am now using the standard Python version:
(base) (yggdrasil)-[eckertd@login1 ID9178]$ which python
Hi @Yann.Sagon , things have not changed, my latest job has huge latency. For instance now it has done nothing for the past ~40 minutes.
Hi, I’m checking job
12066105 on node
Tracing your main job produces no output.
strace -p 121140
strace: Process 121140 attached
And tracing the parent (perl ) just produces a “wait”.
It seems that the
backscale program is doing nothing, no IO, and is maybe stuck in an infinite loop. Are you the author of this script?
In the error log you have this line:
(yggdrasil)-[root@cpu145 ~]$ tail /home/users/e/eckertd/XMM/Groups/ID9178_spec_error.e12066105
rm: cannot remove ‘/tmp/spe_18399’: No such file or directory
According to the htop output, you aren’t using
Python from module. You should add this line to your sbatch script:
ml GCC/11.2.0 OpenMPI/4.1.1 Python/3.9.6 SciPy-bundle/2021.10
Thanks @Yann.Sagon . I didn’t write this script myself, it is part of a package distributed by the European space agency.
I am trying to update the code to the latest version now and will see whether the issue could be due to some bug in this specific executable.
Hi @Yann.Sagon , upgrading the corresponding package to the newest version has worked, the jobs now take less than an hour to complete. It means it was an issue with this particular version of the executable rather than with Yggdrasil. Apologies for that and thanks for your help.