BlockingIOError Yggdrasil

Hi!

When I try to run a job on Yggdrasil I get an BlockingIOError which I cant sort out. Ive pasted the full error message below. The strange thing is that this error is not raised when I dont go via the slurm queue but just use ‘salloc’ to run it on a computing node. Does someone know what is going on?

Felix

Traceback (most recent call last):
  File "/srv/beegfs/scratch/users/f/fvecchi/Run/shear_fit_zs_LastShell_noise.py", line 204, in <module>
    result = search.fit(model=model, analysis=analysis)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/f/fvecchi/cosma_home/Code/AllProjects/PyAutoFit/autofit/non_linear/search/abstract_search.py", line 594, in fit
    result = self.start_resume_fit(
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/f/fvecchi/cosma_home/Code/AllProjects/PyAutoFit/autofit/non_linear/search/abstract_search.py", line 115, in decorated
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/f/fvecchi/cosma_home/Code/AllProjects/PyAutoFit/autofit/non_linear/search/abstract_search.py", line 721, in start_resume_fit
    search_internal = self._fit(
                      ^^^^^^^^^^
  File "/home/users/f/fvecchi/cosma_home/Code/AllProjects/PyAutoFit/autofit/non_linear/search/nest/nautilus/search.py", line 149, in _fit
    search_internal = self.fit_multiprocessing(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/f/fvecchi/cosma_home/Code/AllProjects/PyAutoFit/autofit/non_linear/search/nest/nautilus/search.py", line 265, in fit_multiprocessing
    return self.call_search(search_internal=search_internal, model=model, analysis=analysis)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/f/fvecchi/cosma_home/Code/AllProjects/PyAutoFit/autofit/non_linear/search/nest/nautilus/search.py", line 322, in call_search
    search_internal.run(
  File "/home/users/f/fvecchi/cosma_home/env3.11/lib/python3.11/site-packages/nautilus/sampler.py", line 448, in run
    self.write_shell_update(self.filepath, -1)
  File "/home/users/f/fvecchi/cosma_home/env3.11/lib/python3.11/site-packages/nautilus/sampler.py", line 1319, in write_shell_update
    fstream = h5py.File(Path(filepath), 'r+')
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/f/fvecchi/cosma_home/env3.11/lib/python3.11/site-packages/h5py/_hl/files.py", line 561, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/f/fvecchi/cosma_home/env3.11/lib/python3.11/site-packages/h5py/_hl/files.py", line 237, in make_fid
    fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 102, in h5py.h5f.open
BlockingIOError: [Errno 11] Unable to synchronously open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
srun: error: cpu073: tasks 0-1: Exited with exit code 1
srun: error: cpu073: task 2: Exited with exit code 1
srun: First task exited 30s ago
srun: StepId=35948812.0 task 3: running
srun: StepId=35948812.0 tasks 0-2: exited abnormally
srun: Terminating StepId=35948812.0
slurmstepd: error: *** STEP 35948812.0 ON cpu073 CANCELLED AT 2024-10-18T14:23:47 ***
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.

Hi,

please share your sbatch with us.

Best

Yann

Sure!

#!/bin/bash -l

#SBATCH --ntasks 4
#SBATCH -J Nzs1.6
#SBATCH -o standard_output_file.%J.out
#SBATCH -e standard_error_file.%J.err
#SBATCH -p public-cpu
#SBATCH -t 24:00:00
#SBATCH --mail-type=END                          # notifications for job done & fail
#SBATCH --mail-user=<felix.vecchi@epfl.ch>

srun python shear_fit_zs_LastShell_noise.py 1.6 1

And before I sbatch this file I load python3.11 and my virtual environment

I think I posted my answer on the issue level and not in a direct reply to you. idk if that matters haha

Hi,

You are requesting 4 tasks in your sbatch and your python code is probably not aware of how to handle this and so executed 4 times, which explains the blocking IO.

You should check our doc to see the different job types: hpc:slurm [eResearch Doc]

In your case it is probably either single core or multi core but not distributed.

Best

I think the result is the same. Anyway, remember you can still edit your post if needed, for example to add

To your previous post wher you sent your sbatch.