Jobs sporadicaly fail on BAOBAB

Hi,

I am facing an issue with BAOBAB. My jobs are sporadically failing and I do not know why. I basically submit ~80 jobs at the same time and only few of them are failing. If I submit again the same jobs either I am lucky and no jobs fail or few jobs will fail again but they are not necessarily the same as the failed jobs of the first submission.

Here is the content on the elog file of a failed job:

/var/spool/slurmd/job52403524/slurm_script: line 39: root: command not found
cp: cannot stat ‘/tmp/tmp.tNgQovXxpr/*_*_*.root’: No such file or directory

Edit:
Here is my sbatch scrit:

#!/bin/sh
#----CINT .C file name(mycc+cpcfile) + .C dir(cdir)
#$1: dst version
#$2: ISS (0) or MC (1)
#$3: for ISS ($2=0): trigger period
#    for MC ($2=1): MC focus
#$4: version to analysis
#$5: MC charge
#$6&7: first and last list to run
#$8: process how many events in one list, -1 for all

wdir="/home/users/e/erobyn" #baobab
#--
iFAiter=0

hversion=14

version=analysis_Erwan
mycc=run_analysis6

cpcfile="${mycc}.C ${version}.C readfile5.C SelEvent${hversion}.h QSplineFit.C ChargeCalN9.C TrRes5.C RigEstiCal3.h Efficiency11.h RigReso2.h HistoMan.C ReadRigCalib.h B800MCCor.C EffVal3.h EstiBins.h weight.h file_io.h splineFit2.h BreakUpProb.C BelowL1Background.C ChargeCalibration.C EvtCount.C DetChargeCalibration.C TkLayChargeCalibration.C ToFQCal.C L1Eff.C DAQTrigEff.C ToFEff.C TkEff.C L9Eff.C"

nrunlist=100
[[ $1 -gt 54 ]] && nrunlist=20
if [[ $2 -eq 1 ]]; then
        [[ $3 -eq 0 ]] && nrunlist=1
        [[ $3 -eq 1 ]] && nrunlist=2
        [[ $1 -eq 61 ]] && nrunlist=2
        [[ $1 -ge 62 ]] && nrunlist=20 #2019.12.28
fi

[[ $3 -eq 3 ]] && nrunlist=100

#------
queue=shared-cpu,private-dpnc-cpu #0-12:00:00 #2019.12.10

#----dst version
dstver=$1
amssoft_ver="B1130"
prod_ver=amsd${dstver}n

dst=${prod_ver}

MCfocus=( "l1" "l19" )

if [[ $2 -eq 0 ]]; then
        if [[ $3 -gt 0 ]] && [[ $3 -lt 3 ]]; then
                dst=${dst}_trig${3}
        elif [[ $3 -eq 3 ]]; then
                dst=${dst}_EcalCheck
        fi
elif [[ $2 -eq 1 ]]; then
        if [[ $5 -eq 8 ]]; then
                sQ=O16
        elif [[ $5 -eq 10 ]]; then
                sQ=Ne20
        elif [[ $5 -eq 12 ]]; then
        fi

        if [[ $1 -lt 58 ]]; then
                dst=${dst}MCO16${MCfocus[$3]}
        else
                dst=${dst}_MC${sQ}${MCfocus[$3]}
        fi
fi

echo "${sQ}"

AnalVer=$4
if [[ "$AnalVer" == "0" ]]; then
        aver=ChargeCal
elif [[ "$AnalVer" == "1" ]]; then
        #--Rigidity estimator
        aver=RigEstiCal_85years
elif [[ "$AnalVer" == "2" ]]; then
        aver=Efficiency #2020.02.06: all efficiencies with cutoff
else
        echo "Please choose the corrected version to analysis. Exit"
        exit
fi
if [[ $iFAiter -gt 0 ]] && [[ $AnalVer -eq 2 ]] && [[ $2 -eq 1 ]]; then
        aver=${aver}_Unfold${iFAiter}
fi

aver=${aver}Q$5


#----time limits
timeLimit="0-12:00:00"
#if [[ $4 -eq 2 ]] && [[ $2 -eq 1 ]]; then
#       timeLimit="0-5:00:00"
#fi
if [[ $4 -eq 0 ]] || [[ $4 -eq 2 ]] || [[ $4 -eq 7 ]] || [[ $4 -eq 10 ]]; then
        if [[ $2 -eq 0 ]]; then
                timeLimit="0-4:00:00"
        elif [[ $2 -eq 1 ]] && [[ $3 -eq 0 ]]; then
                timeLimit="0-8:00:00"
        else
                timeLimit="0-5:00:00"
        fi
elif [[ $4 -eq 11 ]] || [[ $4 -eq 9 ]]; then #|| [[ $4 -eq 1 ]]
        timeLimit="0-4:00:00"
fi
echo "time limit: ${timeLimit}"

#----process root-file or list dir  ######3
if [[ $2 -eq 0 ]]; then
        indir=${wdir}/runlist/ISS/${amssoft_ver}/${dst}_${nrunlist}
elif [[ $2 -eq 1 ]]; then
        #indir=${wdir}/runlist/MC/${amssoft_ver}/${dst}_${nrunlist}
        indir=${wdir}/runlist/MC/${amssoft_ver}/${dst}_${nrunlist}_LBLTuning
fi
oname=${dst}_${amssoft_ver}_${aver}

odir=${wdir}/result/${version}/${oname}

ldir=${odir}/log

subname=protest_analysis6_baobab.sh

ffile=0

lfile=`ls ${indir} | wc -l`
if [[ $# -ge 7 ]] && [[ $6 -ne -1 ]]; then
        ffile=$6
        lfile=$7
fi

nevent=-1
if [[ $# -ge 8 ]] && [[ $8 -ne -1 ]]; then
        nevent=$8
fi
echo "process ${nevent} events in one dst (-1 for all)"
echo "input dir is ${indir}"
echo "output dir is ${odir}"
echo "queue to process is ${queue}"

read -p "Continue? (Y)" yn
if [[ "${yn}" != "Y" ]] && [[ "${yn}" != "y" ]]; then
        echo "quit"
        exit 0
fi

#--create directory if not exist
[[ ! -e  ${odir}/${oname} ]] && mkdir -p ${odir}
[[ ! -e  ${ldir} ]] && mkdir -p ${ldir}
[[ ! -e  ${odir}/code ]] && mkdir -p ${odir}/code/
cdir=${odir}/code
for ifile in $cpcfile ; do
        cp -v ${wdir}/analisi/script/$ifile ${cdir}
done

for ((i = ffile; i < lfile ; i++))
do
        ifile1=${indir}/${i}
        echo ${ifile1}

        jobname="${i}_${aver}_${dst}"
        echo "jobname = ${jobname}"
        sbatch -p ${queue} -o ${ldir}/${jobname}.log -e ${ldir}/${jobname}.elog --time=${timeLimit} --job-name=${jobname} ${subname} "${cdir}" "${mycc}" "${cpcfile}" "${ifile1}" "${odir}" "${nevent}" "$version" #2020.02.06
done

date

And the content of the protest_analysis6_baobab.sh script called in the previous one:

#!/bin/bash
#source root_setting_cvmfs.sh #use root in cvmfs instead
echo "HOME=$HOME"

cdir=$1
mycc=$2
cpcfile=$3
ifile=$4
odir=$5
nevent=$6
version=$7

echo "1=$1"
echo "2=$2"
echo "3=$3"
echo "4=$4"
echo "5=$5"
echo "6=$6"
echo "7=$7"

#----cp file
tmpDir=`mktemp -d`
cd ${tmpDir}
echo 'we are in '${PWD}
echo 'tmpDir='${tmpDir}

#------

echo "odir=${odir}"
for cfile in $cpcfile ; do
        cp ${cdir}/$cfile ${tmpDir}
done

#---process
cd ${tmpDir}

date

root -l -b -q ${mycc}.C++'("'${ifile}'", "'${tmpDir}'", '${nevent}')'

echo "copy to ${odir}"
cp ${tmpDir}/*_*_*.root ${odir}

rm -rf ${tmpDir}
echo 'Cleanup done!'
echo "***Job Done***"
date

Let me know if you need additional informations, I have no clue of what could be relevant.
Thanks in advance,
Erwan

Hi, please share your sbatch script with us.

Hi, I have added my sbatch script in my first post.

This doesn’t looks like an Sbatch script. From this script it seems you are calling Sbatch in a for loop.

So, what we call your Sbatch script is in fact the script protest_analysis6_baobab.sh in your case.

According to the error message you have, the issue is in line 39 of this script:

root -l -b -q ${mycc}.C++'("'${ifile}'", "'${tmpDir}'", '${nevent}')'

So it seems it doesn’t find “root” in sometime. Where root is expected to be found?

The location is probably defined in this script? Did you launched it?

I suggest that you do an echo of the relevant variables that are set by root_setting_cvmfs.shto debug this issue.