Hi,
I am facing an issue with BAOBAB. My jobs are sporadically failing and I do not know why. I basically submit ~80 jobs at the same time and only few of them are failing. If I submit again the same jobs either I am lucky and no jobs fail or few jobs will fail again but they are not necessarily the same as the failed jobs of the first submission.
Here is the content on the elog file of a failed job:
/var/spool/slurmd/job52403524/slurm_script: line 39: root: command not found
cp: cannot stat ‘/tmp/tmp.tNgQovXxpr/*_*_*.root’: No such file or directory
Edit:
Here is my sbatch scrit:
#!/bin/sh
#----CINT .C file name(mycc+cpcfile) + .C dir(cdir)
#$1: dst version
#$2: ISS (0) or MC (1)
#$3: for ISS ($2=0): trigger period
# for MC ($2=1): MC focus
#$4: version to analysis
#$5: MC charge
#$6&7: first and last list to run
#$8: process how many events in one list, -1 for all
wdir="/home/users/e/erobyn" #baobab
#--
iFAiter=0
hversion=14
version=analysis_Erwan
mycc=run_analysis6
cpcfile="${mycc}.C ${version}.C readfile5.C SelEvent${hversion}.h QSplineFit.C ChargeCalN9.C TrRes5.C RigEstiCal3.h Efficiency11.h RigReso2.h HistoMan.C ReadRigCalib.h B800MCCor.C EffVal3.h EstiBins.h weight.h file_io.h splineFit2.h BreakUpProb.C BelowL1Background.C ChargeCalibration.C EvtCount.C DetChargeCalibration.C TkLayChargeCalibration.C ToFQCal.C L1Eff.C DAQTrigEff.C ToFEff.C TkEff.C L9Eff.C"
nrunlist=100
[[ $1 -gt 54 ]] && nrunlist=20
if [[ $2 -eq 1 ]]; then
[[ $3 -eq 0 ]] && nrunlist=1
[[ $3 -eq 1 ]] && nrunlist=2
[[ $1 -eq 61 ]] && nrunlist=2
[[ $1 -ge 62 ]] && nrunlist=20 #2019.12.28
fi
[[ $3 -eq 3 ]] && nrunlist=100
#------
queue=shared-cpu,private-dpnc-cpu #0-12:00:00 #2019.12.10
#----dst version
dstver=$1
amssoft_ver="B1130"
prod_ver=amsd${dstver}n
dst=${prod_ver}
MCfocus=( "l1" "l19" )
if [[ $2 -eq 0 ]]; then
if [[ $3 -gt 0 ]] && [[ $3 -lt 3 ]]; then
dst=${dst}_trig${3}
elif [[ $3 -eq 3 ]]; then
dst=${dst}_EcalCheck
fi
elif [[ $2 -eq 1 ]]; then
if [[ $5 -eq 8 ]]; then
sQ=O16
elif [[ $5 -eq 10 ]]; then
sQ=Ne20
elif [[ $5 -eq 12 ]]; then
fi
if [[ $1 -lt 58 ]]; then
dst=${dst}MCO16${MCfocus[$3]}
else
dst=${dst}_MC${sQ}${MCfocus[$3]}
fi
fi
echo "${sQ}"
AnalVer=$4
if [[ "$AnalVer" == "0" ]]; then
aver=ChargeCal
elif [[ "$AnalVer" == "1" ]]; then
#--Rigidity estimator
aver=RigEstiCal_85years
elif [[ "$AnalVer" == "2" ]]; then
aver=Efficiency #2020.02.06: all efficiencies with cutoff
else
echo "Please choose the corrected version to analysis. Exit"
exit
fi
if [[ $iFAiter -gt 0 ]] && [[ $AnalVer -eq 2 ]] && [[ $2 -eq 1 ]]; then
aver=${aver}_Unfold${iFAiter}
fi
aver=${aver}Q$5
#----time limits
timeLimit="0-12:00:00"
#if [[ $4 -eq 2 ]] && [[ $2 -eq 1 ]]; then
# timeLimit="0-5:00:00"
#fi
if [[ $4 -eq 0 ]] || [[ $4 -eq 2 ]] || [[ $4 -eq 7 ]] || [[ $4 -eq 10 ]]; then
if [[ $2 -eq 0 ]]; then
timeLimit="0-4:00:00"
elif [[ $2 -eq 1 ]] && [[ $3 -eq 0 ]]; then
timeLimit="0-8:00:00"
else
timeLimit="0-5:00:00"
fi
elif [[ $4 -eq 11 ]] || [[ $4 -eq 9 ]]; then #|| [[ $4 -eq 1 ]]
timeLimit="0-4:00:00"
fi
echo "time limit: ${timeLimit}"
#----process root-file or list dir ######3
if [[ $2 -eq 0 ]]; then
indir=${wdir}/runlist/ISS/${amssoft_ver}/${dst}_${nrunlist}
elif [[ $2 -eq 1 ]]; then
#indir=${wdir}/runlist/MC/${amssoft_ver}/${dst}_${nrunlist}
indir=${wdir}/runlist/MC/${amssoft_ver}/${dst}_${nrunlist}_LBLTuning
fi
oname=${dst}_${amssoft_ver}_${aver}
odir=${wdir}/result/${version}/${oname}
ldir=${odir}/log
subname=protest_analysis6_baobab.sh
ffile=0
lfile=`ls ${indir} | wc -l`
if [[ $# -ge 7 ]] && [[ $6 -ne -1 ]]; then
ffile=$6
lfile=$7
fi
nevent=-1
if [[ $# -ge 8 ]] && [[ $8 -ne -1 ]]; then
nevent=$8
fi
echo "process ${nevent} events in one dst (-1 for all)"
echo "input dir is ${indir}"
echo "output dir is ${odir}"
echo "queue to process is ${queue}"
read -p "Continue? (Y)" yn
if [[ "${yn}" != "Y" ]] && [[ "${yn}" != "y" ]]; then
echo "quit"
exit 0
fi
#--create directory if not exist
[[ ! -e ${odir}/${oname} ]] && mkdir -p ${odir}
[[ ! -e ${ldir} ]] && mkdir -p ${ldir}
[[ ! -e ${odir}/code ]] && mkdir -p ${odir}/code/
cdir=${odir}/code
for ifile in $cpcfile ; do
cp -v ${wdir}/analisi/script/$ifile ${cdir}
done
for ((i = ffile; i < lfile ; i++))
do
ifile1=${indir}/${i}
echo ${ifile1}
jobname="${i}_${aver}_${dst}"
echo "jobname = ${jobname}"
sbatch -p ${queue} -o ${ldir}/${jobname}.log -e ${ldir}/${jobname}.elog --time=${timeLimit} --job-name=${jobname} ${subname} "${cdir}" "${mycc}" "${cpcfile}" "${ifile1}" "${odir}" "${nevent}" "$version" #2020.02.06
done
date
And the content of the protest_analysis6_baobab.sh script called in the previous one:
#!/bin/bash
#source root_setting_cvmfs.sh #use root in cvmfs instead
echo "HOME=$HOME"
cdir=$1
mycc=$2
cpcfile=$3
ifile=$4
odir=$5
nevent=$6
version=$7
echo "1=$1"
echo "2=$2"
echo "3=$3"
echo "4=$4"
echo "5=$5"
echo "6=$6"
echo "7=$7"
#----cp file
tmpDir=`mktemp -d`
cd ${tmpDir}
echo 'we are in '${PWD}
echo 'tmpDir='${tmpDir}
#------
echo "odir=${odir}"
for cfile in $cpcfile ; do
cp ${cdir}/$cfile ${tmpDir}
done
#---process
cd ${tmpDir}
date
root -l -b -q ${mycc}.C++'("'${ifile}'", "'${tmpDir}'", '${nevent}')'
echo "copy to ${odir}"
cp ${tmpDir}/*_*_*.root ${odir}
rm -rf ${tmpDir}
echo 'Cleanup done!'
echo "***Job Done***"
date
Let me know if you need additional informations, I have no clue of what could be relevant.
Thanks in advance,
Erwan