Job manager slurm failing to send jobs (baobab)

Hi,

I have an issue while sending jobs with the srun command manually from the command line or through a sbatch script.

The program we are using was installed at :
/sst1m/sw/prod5/sim_telarray/bin//sim_telarray

The srun command used to send the job manually is the following :

user$ srun /sst1m/sw/prod5/sim_telarray/bin//sim_telarray -I/sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/cfg/ -c /sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/cfg//CTA-PROD5-LaPalma-baseline_4LSTs_MAGIC.cfg -DNUM_TELESCOPES=1 -DNO_STEREO_TRIGGER=1 -C min_photons=0 -C min_photoelectrons=0 -C save_photons=3 -C only_triggered_telescopes=1 -C only_triggered_arrays=1 -C random_state=auto -C show=all -C maximum_events=100000 -C maximum_telescopes=1 -C telescope_phi=180 -C telescope_zenith_angle=20 -C asum_threshold=300 -C trigger_current_limit=2000.0 -C nightsky_background=all:0.1076 -C nsb_scaling_factor=2 -C dark_events=0 -C pedestal_events=0 -h /sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/output//hist/dummy100000_asum_threshold_300.hdata -o /sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/output/dummy100000_asum_threshold_300.simtel.gz /sst1m/data/prod5/corsika/dummy//dummy100000.corsika.gz > /sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/output//log/dummy100000_asum_threshold_300.log

The message error obtained while running the command above was:

srun: job 42071227 queued and waiting for resources
srun: job 42071227 has been allocated resources
slurmstepd: error: couldn't chdir to `/sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2': No such file or directory: going to /tmp instead
slurmstepd: error: execve(): /sst1m/sw/prod5/sim_telarray/bin//sim_telarray: No such file or directory
srun: error: node001: task 0: Exited with exit code 2

However, if avoid the srun command (launching the entire command line manually, with no use of the job manager), the program runs :

user$ /sst1m/sw/prod5/sim_telarray/bin//sim_telarray -I/sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/cfg/ -c /sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/cfg//CTA-PROD5-LaPalma-baseline_4LSTs_MAGIC.cfg -DNUM_TELESCOPES=1 -DNO_STEREO_TRIGGER=1 -C min_photons=0 -C min_photoelectrons=0 -C save_photons=3 -C only_triggered_telescopes=1 -C only_triggered_arrays=1 -C random_state=auto -C show=all -C maximum_events=100000 -C maximum_telescopes=1 -C telescope_phi=180 -C telescope_zenith_angle=20 -C asum_threshold=300 -C trigger_current_limit=2000.0 -C nightsky_background=all:0.1076 -C nsb_scaling_factor=2 -C dark_events=0 -C pedestal_events=0 -h /sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/output//hist/dummy100000_asum_threshold_300.hdata -o /sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/output/dummy100000_asum_threshold_300.simtel.gz /sst1m/data/prod5/corsika/dummy//dummy100000.corsika.gz > /sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/output//log/dummy100000_asum_threshold_300.log

Yielding a correct output

Configuration file is '/sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/cfg//CTA-PROD5-LaPalma-baseline_4LSTs_MAGIC.cfg'.
Preprocessor is '/sst1m/sw/prod5/sim_telarray/bin//pfp -v -I. -DNUM_TELESCOPES=1 -DNO_STEREO_TRIGGER=1 -DWITH_LOW_GAIN_CHANNEL -DMAX_GAINS=2 -DSIMTEL_VERSION=1593356843 -DSIMTEL_RELEASE=20200628 -I/sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/cfg/ -I. -I/sst1m/sw/prod5/sim_telarray/cfg -I/sst1m/sw/prod5/sim_telarray/cfg/common -I/sst1m/sw/prod5/sim_telarray/cfg/hess -I/sst1m/sw/prod5/sim_telarray/cfg/hess2 -I/sst1m/sw/prod5/sim_telarray/cfg/hess3 -I/sst1m/sw/prod5/sim_telarray/cfg/hess5000 -I/sst1m/sw/prod5/sim_telarray/cfg/CTA'.
Read atmospheric transmission data from file atm_trans_2158_1_3_2_0_0_0.1_0.1.dat
Got 800 wavelength intervals for 41 heights starting at 2.158 km
Preprocessor command: /sst1m/sw/prod5/sim_telarray/bin//pfp -v -I. -DNUM_TELESCOPES=1 -DNO_STEREO_TRIGGER=1 -DWITH_LOW_GAIN_CHANNEL -DMAX_GAINS=2 -DSIMTEL_VERSION=1593356843 -DSIMTEL_RELEASE=20200628 -I/sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/cfg/ -I. -I/sst1m/sw/prod5/sim_telarray/cfg -I/sst1m/sw/prod5/sim_telarray/cfg/common -I/sst1m/sw/prod5/sim_telarray/cfg/hess -I/sst1m/sw/prod5/sim_telarray/cfg/hess2 -I/sst1m/sw/prod5/sim_telarray/cfg/hess3 -I/sst1m/sw/prod5/sim_telarray/cfg/hess5000 -I/sst1m/sw/prod5/sim_telarray/cfg/CTA -DMAX_GAINS=2 -DTELESCOPE=1 - < /sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/cfg//CTA-PROD5-LaPalma-baseline_4LSTs_MAGIC.cfg
Table with 53 rows has been read from file CTA-LST_lightguide_eff_SST1M.dat

Warning: CORSIKA producing only photons in the range 200 to 700 nm
but telescope 1 has sensitivity from 300 to 790 nm.
Extending the range to 200 to 790 nm would imply 1.0191 times bigger bunches.
No such correction is implemented (but could be done unless CEFFIC or CERWLEN are used).
The impact on the signal though is expected to be negligible. No problem.

Launching the task should be also made by a sbatch script containing :

#!/bin/bash
#SBATCH --partition=debug-EL7
#SBATCH --time=00:03:00
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=2200 # in MB
#SBATCH --output=/sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/run///log/job_sim_telarray_parameter_scan_12.log
#SBATCH --error=/sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/run///error/job_sim_telarray_parameter_scan_12.err

srun /sst1m/sw/prod5/sim_telarray/bin//sim_telarray -I/sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/cfg/ -c /sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/cfg//CTA-PROD5-LaPalma-baseline_4LSTs_MAGIC.cfg -DNUM_TELESCOPES=1 -DNO_STEREO_TRIGGER=1 -C min_photons=0 -C min_photoelectrons=0 -C save_photons=3 -C only_triggered_telescopes=1 -C only_triggered_arrays=1 -C random_state=auto -C show=all -C maximum_events=100000 -C maximum_telescopes=1 -C telescope_phi=180 -C telescope_zenith_angle=20 -C asum_threshold=300 -C trigger_current_limit=2000.0 -C nightsky_background=all:0.1076 -C nsb_scaling_factor=2 -C dark_events=0 -C pedestal_events=0 -h /sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/output//hist/dummy100000_asum_threshold_300.hdata -o /sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/output/dummy100000_asum_threshold_300.simtel.gz /sst1m/data/prod5/corsika/dummy//dummy100000.corsika.gz > /sst1m/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_nsbx2/output//log/dummy100000_asum_threshold_300.log

which also fails due to the srun command, we believe.

We would like to know how to solve this issue so we can resume our jobs.
Thanks in advance

Hello David,

We had to dig up a bit to find out the problem and I believe your issue is the following.

Your code works when you don’t use srun because the commands are executed on login2 (Baobab).
On login2, the mount point to /sst1m is done when the server boots up on:
nasac-evs1.unige.ch:/s-astro/archive/walter/sst1m

When you use srun, your code is executed on a/multiple node(s) of the cluster. On the nodes, you access /sst1m with autofs

autofs is a program for automatically mounting directories on an as-needed basis. Auto-mounts are mounted only as they are accessed

Now, the issue is that someone renamed the folder you are trying to access (please note the “_old” after walter :
nasac-evs1.unige.ch:/s-astro/archive/walter_old/sst1m

Your configuration changed, but we haven’t been notified of this change, so our configuration files still point to the old path:
nasac-evs1.unige.ch:/s-astro/archive/walter/sst1m

Now, on login2, the server hasn’t been rebooted in some time, and at the time the mount point was still valid (the name walter changed to walter_old, but the inode didn’t).

You have to decide internally at astro department, which path we should use as - as far as I know - the data you are trying to access has been/is being migrated to a new storage.

We have contacted Walter Roland on this matter.

Let us know as soon as a decision has been made and we will try to upgrade the autofs configuration accordingly.

In the meantime, if you computations are really urgent, I suggest the following workaround:
a) Howto access external storage from Baobab - #8 by Silas.Kieser
b) Copy the files you need for you code to run in your $HOME/scratch folder until you can access the shared space again.

I hope this helps!

All the best,

Massimo Brero

Hi Massimo,

Thank you for your answer.
Indeed, I was brought to might attention that the migration already was done to Yggdrassil. However, the files in /sst1m in baobab were left untouched.

The SST-1M data that were previously located on theacademic NAS have been moved to a new ISILON disk storage to be mounted on the computing nodes of yggdrasil.
“to be mounted” means that the data are there but
not yet formally available from the compute nodes
because a decision was needed how to implement
these mount points (and related unix groups) for various
projects university wide. I was told a few days ago
that the decision had been taken and being implemented
but the white smoke is not yet there.
Apparently the data are already available here:
/srv/verso/projects/cta/sst1m
but this will probably still change.

So in Yggdrasil under the path /srv/verso/projects/cta/sst1m I found the data that was migrated from baobab and whose mounting path might change in the next days.

Therefore I decided to remain working in BAOBAB. However the data set on the /sst1m/data/prod5/corsika is incomplete. For which I resume a download from an external server and that download is still ongoing now and should be finished during the current weekend.

What I understand from your reply is that the current downloading data is under the path :
nasac-evs1.unige.ch:/s-astro/archive/walter_old/sst1m
Is this correct?

I can’t confirm this myself since when I do user$ pwd command I get the following output:
/sst1m/data/prod5/corsika

However, I would think the best solution is to avoid migrating mor data to the scratch folder and update the path of the new configuration :
nasac-evs1.unige.ch:/s-astro/archive/walter_old/sst1m
As far as I understand this is just BAOBAB issue and will not influence anything under Yggdrasil, correct?

Also, I guess other users of the /sst1m may have the same issue and not only me.

Let me know if this change is possible or we need authorization from Roland Walter.

Best regards,

David

After some discussions on a private communication by email I understood that working on baobab isn’t the best option, it is will eventually unmounted and eventually deleted. So I start the migration of my data to the provisional path /srv/verso/projects/cta/sst1m in Yggdrasil and implemented the solution suggested by Massimo :

It seems during this afternoon there was some change, and this was expected, maybe an un-mounting of the disk and a well not reset of the node probably, because now I am able to reproduce the issue reported above in Yggdrasil.

So by launching with sbatch or srun command I am not able to run my simulations

(py37) [medinami@login1 run]$ srun /home/users/m/medinami/software/corsika_simtel_2020-06-29/sim_telarray/bin//sim_telarray -I/home/users/m/medinami/scratch/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_proton_nsbx0/cfg/ -c /home/users/m/medinami/scratch/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_proton_nsbx0/cfg//CTA-PROD5-LaPalma-baseline_4LSTs_MAGIC.cfg -DNUM_TELESCOPES=1 -DNO_STEREO_TRIGGER=1 -C min_photons=0 -C min_photoelectrons=0 -C save_photons=3 -C only_triggered_telescopes=1 -C only_triggered_arrays=1 -C random_state=auto -C show=all -C maximum_events=100000000 -C maximum_telescopes=1 -C telescope_phi=180 -C telescope_zenith_angle=20 -C asum_threshold=50 -C trigger_current_limit=2000.0 -C nightsky_background=all:0.1076 -C nsb_scaling_factor=0 -C dark_events=0 -C pedestal_events=0 -h /home/users/m/medinami/scratch/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_proton_nsbx0/output//hist/corsika_run3479_asum_threshold_50.hdata -o /home/users/m/medinami/scratch/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_proton_nsbx0/output/corsika_run3479_asum_threshold_50.simtel.gz /srv/verso/projects/sst1m/data/prod5/corsika/proton/rate_scan//corsika_run3479.corsika.gz > /home/users/m/medinami/scratch/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_proton_nsbx0/output//log/corsika_run3479_asum_threshold_50.log
srun: job 1116382 queued and waiting for resources
srun: job 1116382 has been allocated resources
Configuration file is '/home/users/m/medinami/scratch/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_proton_nsbx0/cfg//CTA-PROD5-LaPalma-baseline_4LSTs_MAGIC.cfg'.
Preprocessor is '/home/users/m/medinami/software/corsika_simtel_2020-06-29/sim_telarray/bin//pfp -v -I. -DNUM_TELESCOPES=1 -DNO_STEREO_TRIGGER=1 -DWITH_LOW_GAIN_CHANNEL -DMAX_GAINS=2 -DSIMTEL_VERSION=1593356843 -DSIMTEL_RELEASE=20200628 -I/home/users/m/medinami/scratch/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_proton_nsbx0/cfg/ -I. -I/home/users/m/medinami/software/corsika_simtel_2020-06-29/sim_telarray/cfg -I/home/users/m/medinami/software/corsika_simtel_2020-06-29/sim_telarray/cfg/common -I/home/users/m/medinami/software/corsika_simtel_2020-06-29/sim_telarray/cfg/hess -I/home/users/m/medinami/software/corsika_simtel_2020-06-29/sim_telarray/cfg/hess2 -I/home/users/m/medinami/software/corsika_simtel_2020-06-29/sim_telarray/cfg/hess3 -I/home/users/m/medinami/software/corsika_simtel_2020-06-29/sim_telarray/cfg/hess5000 -I/home/users/m/medinami/software/corsika_simtel_2020-06-29/sim_telarray/cfg/CTA'.
Read atmospheric transmission data from file atm_trans_2158_1_3_2_0_0_0.1_0.1.dat
Got 800 wavelength intervals for 41 heights starting at 2.158 km
Preprocessor command: /home/users/m/medinami/software/corsika_simtel_2020-06-29/sim_telarray/bin//pfp -v -I. -DNUM_TELESCOPES=1 -DNO_STEREO_TRIGGER=1 -DWITH_LOW_GAIN_CHANNEL -DMAX_GAINS=2 -DSIMTEL_VERSION=1593356843 -DSIMTEL_RELEASE=20200628 -I/home/users/m/medinami/scratch/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_proton_nsbx0/cfg/ -I. -I/home/users/m/medinami/software/corsika_simtel_2020-06-29/sim_telarray/cfg -I/home/users/m/medinami/software/corsika_simtel_2020-06-29/sim_telarray/cfg/common -I/home/users/m/medinami/software/corsika_simtel_2020-06-29/sim_telarray/cfg/hess -I/home/users/m/medinami/software/corsika_simtel_2020-06-29/sim_telarray/cfg/hess2 -I/home/users/m/medinami/software/corsika_simtel_2020-06-29/sim_telarray/cfg/hess3 -I/home/users/m/medinami/software/corsika_simtel_2020-06-29/sim_telarray/cfg/hess5000 -I/home/users/m/medinami/software/corsika_simtel_2020-06-29/sim_telarray/cfg/CTA -DMAX_GAINS=2 -DTELESCOPE=1 - < /home/users/m/medinami/scratch/data/prod5/simtel/mono-lst-sipm-borofloat-3ns/rate_scan_proton_nsbx0/cfg//CTA-PROD5-LaPalma-baseline_4LSTs_MAGIC.cfg
Table with 53 rows has been read from file CTA-LST_lightguide_eff_SST1M.dat
/srv/verso/projects/sst1m/data/prod5/corsika/proton/rate_scan//corsika_run3479.corsika.gz: No such file or directory

It says the file that actually exists is not found, but as you can see:

(py37) [medinami@login1 error]$ ls /srv/verso/projects/sst1m/data/prod5/corsika/proton/rate_scan/
corsika_run101.corsika.gz  corsika_run1768.corsika.gz  corsika_run2142.corsika.gz  corsika_run3479.corsika.gz  corsika_run388.corsika.gz
corsika_run134.corsika.gz  corsika_run1796.corsika.gz  corsika_run2696.corsika.gz  corsika_run3530.corsika.gz  corsika_run4196.corsika.gz
corsika_run155.corsika.gz  corsika_run182.corsika.gz   corsika_run3442.corsika.gz  corsika_run3779.corsika.gz  corsika_run522.corsika.gz
(py37) [medinami@login1 error]$ ls /srv/verso/projects/sst1m/data/prod5/corsika/proton/rate_scan//corsika_run3479.corsika.gz
/srv/verso/projects/sst1m/data/prod5/corsika/proton/rate_scan//corsika_run3479.corsika.gz

So basically I believe someone manage to reproduce the same mistake with the nodes in baobab, also in yggdrasil.

At this point I am stucked since I can’t use the data on those path (baobab or yggdrasil) to launch my simulation.
Furthermore, no one of our group would be able to use the resources of the univeristy and the data stored on those resources.
Would you mind give us a dateline for this to be properly fixed to report to my collgues?

Best regards,

David

Hi David,

You can now reach your data from Yggdrasil or Baobab under : /srv/verso/projects/sst1m/

It would be better to use Yggdrasil for reaching it as the storage is located in Sauverny like Yggdrasil is.

Best regards,

Remy

Hi Remy,

It is working correctly now.
I’ll notify if I encounter issues but I already tested it with some test simulations.

I also started migrating the remaining data in baobab to the location you point in yggdrasil (/srv/verso/projects/sst1m/), they should be done in a couple of days.

Thank you very much for the help.

Best regards,

David