Strange module load and I/O errors

Hi all,

I have a strange code error and cannot figure out what is wrong.

The issue
Generally, I am running the same script many times with different parameters by using job arrays. These are very small simulations with 4GB memory and 6min runtime each. The first 2’000 runs ran without any issues. The next 4’000 runs with the same code had some sort of module load issues. I don’t understand why the first batch runs and the second batch doesn’t. Now if I try to run the first batch again, this also does not run anymore. So basically nothing works anymore.

Example error message 1 (in .out file):

Fatal Python error: initsite: Failed to import the site module
OSError: [Errno 121] Remote I/O error

Example error message 2:

ERROR: Solver (gurobi) returned non-zero return code (1)
ERROR: Solver log: Traceback (most recent call last):
File “”, line 5, in
NameError: name ‘gurobi_run’ is not defined
Traceback (most recent call last):
File “run_NoTargets_MinCost.py”, line 37, in
model,results = solve_model_mincost(model)
File “/home/sasse/FOURTH_STUDY/EXPANSE/solve_model.py”, line 44, in solve_model_mincost
options=opts[‘solv_opts’], tee=bool(opts[‘write_log_files’]))
File “/home/sasse/baobab_python_env/lib/python3.7/site-packages/pyomo/opt/base/solvers.py”, line 600, in solve
“Solver (%s) did not exit normally” % self.name)
pyutilib.common._exceptions.ApplicationError: Solver (gurobi) did not exit normally

What I tried

  • Increase requested memory → still didn’t work
  • Choose a different partition → still didn’t work
  • Checked if Gurobi license is valid → license ok
  • Looked for errors with my code → cannot be because until yesterday everything worked, now the same model runs do not work at all

Temporary solution
These issues were on Yggdrasil. I copied the code to Baobab and ran it there with zero issues. So I will work with the other cluster for now.

It would be still interesting to find out what the underlying issue is on Yggdrasil.

Jan

Hi there,

Can you please send us (here or in private) the Slurm JobID(s) that failed?

We were experiencing storage issues last week-end on Yggdrasil ${HOME} (cf. Current issues on Baobab and Yggdrasil - #60 by Luca.Capello), so maybe your I/O errors are related to that.

Thx, bye,
Luca

Hi Luca,

One example run with specifically error 1 is job ID 6041309 with the job array ID 1961 which ran on cpu044.yggdrasil

Jan

Hi there,

Hi there,

Thanks:

[root@admin1 ~]# sacct -j 6041309 --format=JobID,Partition,AllocCPUS,State,ExitCode,NodeList,Start,End | grep 1961
6041309_1961 shared-cpu          5     FAILED      1:0          cpu044 2021-07-22T20:48:24 2021-07-22T20:48:26
[root@admin1 ~]# 

I could not find any error on cpu044.yggdrasil nor on the BeeGFS servers for that time on connections from that node, sorry.

You could add an extra “check” into your Python script, something like ls -l ${PWD} to be sure the underlying storage (BeeGFS ${HOME} or ${SCRATCH}) is still available.

For error 2, while I do not know at all Gurobi neither its Python bindings, I would say that the error is “the same” as the first one, since from what I have found on the net it should load gurobi_run.py (cf. python - pyinstaller ModuleNotFoundError: No module named 'GUROBI_RUN' - Stack Overflow).

It would be interesting to know if other users had similar I/O problems on Yggdrasil during the same time period.

Thx, bye,
Luca

Hi Luca,

Thanks a lot.

I am running the code on Baobab now since it still does not work on Yggdrasil. I am always getting the error 2

ERROR: Solver (gurobi) returned non-zero return code (1)

The underlying storage looks like it is available

[sasse@login1 ~]$ ls -l ${PWD}
total 3
drwxr-xr-x 6 sasse unige 5 Dec 3 2020 baobab_python_env
drwxr-xr-x 10 sasse unige 13 Aug 10 15:47 CH_MUN_EXPANSE_BAU
drwxr-xr-x 10 sasse unige 13 May 18 20:16 CH_MUN_EXPANSE_ZERO
drwxr-xr-x 7 sasse unige 12 Jan 12 2021 expanse-eur
drwxr-xr-x 13 sasse unige 16 Jul 21 11:59 FOURTH_STUDY
lrwxr-xr-x 1 root root 34 Nov 13 2020 scratch → /srv/beegfs/scratch/users/s/sasse/