Intermittent Remote I/O error of scratch files on Yggdrasil

Dear HPC team:

We run calculations mainly with Q-Chem, which uses a scratch folder to save temporary files needed for the calculations. These files (the whole folder actually) are deleted after the calculation is done, usually. In a couple of occasions now, it had happened that the files are not found while the calculation is still running giving the following error:

/home/share/wesolowski/qchem_trunk/exe/qcprog.exe .hf.in.230284.qcin.1 
/home/ricardi/scratch/qchem230284/
FileMan error: Could not open file FILE_SOL_ENERGY
 Path: /home/ricardi/scratch/qchem230284/686.0: Remote I/O error
FileMan error: Could not open file UNKNOWN FILE
 Path: /home/ricardi/scratch/qchem230284/10.0: Remote I/O error
rm: No match.
Error: in the serial run
srun: error: cpu088: task 0: Exited with exit code 1

Do you think there was some failure on the scratch space or something? or is it about the flow of information? I suggest this second possible cause because we also just had another problem where the program was using the input of a file with the same name but from a different folder [i.e. path1/hf.in and path2/hf.in], probably running at the same time but in different nodes. [I may create a separate ticket for this if the error persists].

Thanks a lot for your time,

Cristina and Nico

Hi,

the latest error was the 2nd of June and it was on another compute node.

Please share your sbatch and any relevant information.

What is the scratch size needed for one job? You may want to use the /scratch which is local to every compute node instead.

Best

Yann

Hi Yann,

The job was run by Nico (Ricardi), with id: 4672821.
The file used to submit the calculation reads:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
####### SYNTAX #######
# sbatch [--mem=<MB>, etc.] QCsub [$1 = input file]
# uncomment in case of the libirng.so problem:
module load intel/2020a

####### Q-Chem env. variables #######
export QC=/home/share/wesolowski/qchem_trunk
export QCAUX=/home/share/wesolowski/qcaux
export QCSCRATCH=$HOME/scratch
export PATH=$PATH:$QC/bin:$QC/bin/perl

####### File variables #######
InFile=$1
Extension="${InFile##*.}"
Filename="${InFile%.*}"   
######## Job settings ###########
nthreads=$SLURM_CPUS_PER_TASK   

######## Run ###########
echo "Q-Chem 5.1 compiled with [gcc openmp release --with-libintracule]"
echo "-- compiled February 2020"
echo "-- GNU compiler: GCC/5.4.0"
echo "-- Enabled Q-Chem features: openmp, cosmo, intracule"
srun qchem -nt $nthreads $Filename.$Extension ${Filename}.out

So, I am not sure how much scratch space we need, usually, small calculations like this should take maybe hundreds of MB, so very little.
Yes, we could change the scratch path to use /scratch instead, which I guess would only affect the line:

export QCSCRATCH=/scratch

correct?

Thanks,

Cristina

Hi Cristina,

Your sbatch script is probably incomplete or you override some parameters in command line: missing number of cores, partition, timelimit, etc.

Did you launched more than one qchem job at a time?

Iā€™m asking as you are specifying a non dedicated QCSCRATCH directory which is probably shared with other qchem instances and according to the documentation, qchem cleans this directory at the end of a successful job.

By the way, if you want to use local scratch, it is suggested to set another variable:QCLOCALSCR see doc.

Hi Yann,

Yes, the rest of the parameters are given in the command line.

and yes, we run many qchem jobs at the same time.

As you can see on my first message I was aware of this. So you are suggesting using different folders for the scratch of each calculation and that the best is to use QCLOCALSCR?
OK, yes, I believe that could solve the issue, we will try.

Thanks,

Cristina

Hi,

I suggest to use a dedicated scratch directory in the BeeGFS scratch space like this:

mkdir $HOME/scratch/$SLURM_JOB_ID
export QCSCRATCH=$HOME/scratch/$SLURM_JOB_ID

or better, use the compute nodes local scratch:

export QCLOCALSCR=/scratch

In this case, there is no need to create a dedicated directory, the /scratch is isolated from the other jobs.
The only issue may be that this space is too small for your needs.

Best

1 Like