I/O problems on scratch (baobab)

Hi,

When I run python scripts on baobab that accesses /scratch/, I get the following error:

OSError: [Errno 121] Remote I/O error:

when writing to /scratch/ like mkdir etc. It seems like the connection is falling in and out because my scripts can run for a bit, but writing to /scratch/ at some point will result in the error above.
It happens when running sbatch jobs on gpu002.

Malte

Hello,

I have done multiple checks on beegfs scratch on gpu002 and all seemed to be ok. I write a file and I remove in your home folder. When you write in /scratch/ you mean : “/srv/beegfs/scratch/users/a/algren” ?

Monitoring has send errors this morning but now all is ok, did you try the job before this morning 2.00 am?

Thanks,
Best,

Hi again,
Yes it’s the connection to my own /srv/beegfs/scratch/users/a/algren. And the issue doesn’t seem to be fixed. I ran multiple jobs on multiple gpus today, and some of them had the issue and others are still running.

Here are a few examples:

@Gael.Rossignol

I am still seeing this issue on the GPUs of the cluster. Its when I try to write to the /scratch/

Dear @Malte.Algren

I tried the following without error:

(baobab)-[root@gpu002 dl1r]$ pwd
/srv/beegfs/scratch/users/a/algren/trained_networks/ftag_calib/sample_template/2023-04-17_09-28-05-745001_Flavor_Response_up/ftag_h5_sig_04_17_2023_09_28_05_816515/figures/dl1r
(baobab)-[root@gpu002 dl1r]$ touch dl1r_epoch_nr_xx.png

Please show us your sbatch script.

Best

Yann

I am getting the same error, it only happens sometimes when I try to save, and appears to happen randomly.

Hi,

This is not happening when initiating the jobs. Our jobs runs for mins or hours but at some point when writing to the /scratch/ (saving a figure or a model on the /scratch/), we get this I/O error like python cannot connect to the scratch.

Here is my sbatch:

#!/bin/sh
#SBATCH --job-name=OT_hadronic_gridsearch
#SBATCH --time=24:00:00
#SBATCH --partition=private-dpnc-gpu,shared-gpu
#SBATCH --chdir=/home/users/a/algren/work/hadronic_recoil
#SBATCH --mem=24GB
#SBATCH --gres=gpu:1
#SBATCH --output=logs/slurm-%A-%x_%a.out
#SBATCH --cpus-per-task=3
#SBATCH -a 0-16


##### Adding options for grid search ######
train_argsloss_wasser_ratio=(1)
train_argslr_f=(0.0001)
train_argslr_g=(0.0001)
train_argsf_per_g=(4)
train_argsg_per_f=(8 16)
train_argsnepochs=(300)
train_argsepoch_size=(512)
train_argsdatatype=()
train_argsbatch_size=(1024)
train_argslearning_rate_scheduler=(True)
model_argsn_layers=(4 8)
model_argsnonconvex_layersizes=(4 8)
model_argsconvex_layersizes=(32 64)
model_argsnonconvex_activation=(softplus)
model_argsconvex_activation=(softplus)
model_argsnoncvx_norm=(standard_first)
model_argscvx_norm=(standard_first)
model_argscorrection_trainable=(True)
cvx_dim=(1)
noncvx_dim=(1)
path=(/home/users/a/algren/scratch/trained_networks/hadronic_recoil/calibration/gridsearch/)
device=(cuda)

##### Job script ######
export XDG_RUNTIME_DIR=""
module load GCCcore/8.2.0 Singularity/3.4.0-Go-1.12



srun singularity exec --nv -B /home/users/a/algren/scratch:/srv/beegfs/scratch/users/a/algren/,/srv/beegfs/scratch/groups/rodem/ /home/users/a/algren/singularity_images/ftag-otcalib-pytorch-new-2.sif\
	python3 run_calib.py \
		train_args.loss_wasser_ratio=${train_argsloss_wasser_ratio[`expr ${SLURM_ARRAY_TASK_ID} / 1 % 1`]}\
		train_args.lr_f=${train_argslr_f[`expr ${SLURM_ARRAY_TASK_ID} / 1 % 1`]}\
		train_args.lr_g=${train_argslr_g[`expr ${SLURM_ARRAY_TASK_ID} / 1 % 1`]}\
		train_args.f_per_g=${train_argsf_per_g[`expr ${SLURM_ARRAY_TASK_ID} / 1 % 1`]}\
		train_args.g_per_f=${train_argsg_per_f[`expr ${SLURM_ARRAY_TASK_ID} / 1 % 2`]}\
		train_args.nepochs=${train_argsnepochs[`expr ${SLURM_ARRAY_TASK_ID} / 2 % 1`]}\
		train_args.epoch_size=${train_argsepoch_size[`expr ${SLURM_ARRAY_TASK_ID} / 2 % 1`]}\
		train_args.datatype=${train_argsdatatype[`expr ${SLURM_ARRAY_TASK_ID} / 2 % 1`]}\
		train_args.batch_size=${train_argsbatch_size[`expr ${SLURM_ARRAY_TASK_ID} / 2 % 1`]}\
		train_args.learning_rate_scheduler=${train_argslearning_rate_scheduler[`expr ${SLURM_ARRAY_TASK_ID} / 2 % 1`]}\
		model_args.n_layers=${model_argsn_layers[`expr ${SLURM_ARRAY_TASK_ID} / 2 % 2`]}\
		model_args.nonconvex_layersizes=${model_argsnonconvex_layersizes[`expr ${SLURM_ARRAY_TASK_ID} / 4 % 2`]}\
		model_args.convex_layersizes=${model_argsconvex_layersizes[`expr ${SLURM_ARRAY_TASK_ID} / 8 % 2`]}\
		model_args.nonconvex_activation=${model_argsnonconvex_activation[`expr ${SLURM_ARRAY_TASK_ID} / 16 % 1`]}\
		model_args.convex_activation=${model_argsconvex_activation[`expr ${SLURM_ARRAY_TASK_ID} / 16 % 1`]}\
		model_args.noncvx_norm=${model_argsnoncvx_norm[`expr ${SLURM_ARRAY_TASK_ID} / 16 % 1`]}\
		model_args.cvx_norm=${model_argscvx_norm[`expr ${SLURM_ARRAY_TASK_ID} / 16 % 1`]}\
		model_args.correction_trainable=${model_argscorrection_trainable[`expr ${SLURM_ARRAY_TASK_ID} / 16 % 1`]}\
		cvx_dim=${cvx_dim[`expr ${SLURM_ARRAY_TASK_ID} / 16 % 1`]}\
		noncvx_dim=${noncvx_dim[`expr ${SLURM_ARRAY_TASK_ID} / 16 % 1`]}\
		path=${path[`expr ${SLURM_ARRAY_TASK_ID} / 16 % 1`]}\
		device=${device[`expr ${SLURM_ARRAY_TASK_ID} / 16 % 1`]}\

Hi,

I also saw the same issues during last week.
Error message

mkdir: cannot create directory ‘/home/users/e/ehrke/scratch/ttcharm/GraphCombinatorics/logs/topotransformer/2023-04-11--09-32_24240_100’: Remote I/O error
/var/spool/slurmd/job24240/slurm_script: line 21: /home/users/e/ehrke/scratch/ttcharm/GraphCombinatorics/logs/topotransformer/2023-04-11--09-32_24240_100/script.out: No such file or directory

sbatch script

#!/bin/sh

#SBATCH -t 12:00:00
#SBATCH -n 1
#SBATCH -c 1
#SBATCH --mem=16000
#SBATCH -p private-dpnc-gpu,shared-gpu
#SBATCH --gres=gpu:1
#SBATCH -o ./slurmOutput/output.%A_%a.out
#SBATCH -a 0-593%50

module load GCCcore/12.2.0
module load Python/3.10.8
. /home/users/e/ehrke/topotransformer_env/bin/activate


now=$(date +"%Y-%m-%d--%H-%M")
logdir="/home/users/e/ehrke/scratch/ttcharm/GraphCombinatorics/logs/topotransformer/$now"_"$SLURM_JOB_ID"_"$SLURM_ARRAY_TASK_ID"
mkdir -p $logdir

/home/users/e/ehrke/topotransformer_env/bin/python train.py configs/config_$SLURM_ARRAY_TASK_ID.yaml $logdir -a > $logdir/script.out

@Malte.Algren @Lukas.Ehrke @Samuel.Klein

There is an issue with the scratch storage: [2023] Current issues on HPC Cluster - #5 by Yann.Sagon

The issue in your case is that when the max number of files is reached, there is an error but as soon as a file is erased, you can create a new file, this is the explanation why it was working sometime and sometime not.