When I run python scripts on baobab that accesses /scratch/, I get the following error:
OSError: [Errno 121] Remote I/O error:
when writing to /scratch/ like mkdir etc. It seems like the connection is falling in and out because my scripts can run for a bit, but writing to /scratch/ at some point will result in the error above.
It happens when running sbatch jobs on gpu002.
I have done multiple checks on beegfs scratch on gpu002 and all seemed to be ok. I write a file and I remove in your home folder. When you write in /scratch/ you mean : “/srv/beegfs/scratch/users/a/algren” ?
Monitoring has send errors this morning but now all is ok, did you try the job before this morning 2.00 am?
Yes it’s the connection to my own /srv/beegfs/scratch/users/a/algren. And the issue doesn’t seem to be fixed. I ran multiple jobs on multiple gpus today, and some of them had the issue and others are still running.
This is not happening when initiating the jobs. Our jobs runs for mins or hours but at some point when writing to the /scratch/ (saving a figure or a model on the /scratch/), we get this I/O error like python cannot connect to the scratch.
I also saw the same issues during last week.
mkdir: cannot create directory ‘/home/users/e/ehrke/scratch/ttcharm/GraphCombinatorics/logs/topotransformer/2023-04-11--09-32_24240_100’: Remote I/O error
/var/spool/slurmd/job24240/slurm_script: line 21: /home/users/e/ehrke/scratch/ttcharm/GraphCombinatorics/logs/topotransformer/2023-04-11--09-32_24240_100/script.out: No such file or directory
The issue in your case is that when the max number of files is reached, there is an error but as soon as a file is erased, you can create a new file, this is the explanation why it was working sometime and sometime not.