Srun step creation temporarily disabled, retrying

Cristina.GonzalezEspinoza · March 10, 2021, 10:26am

Dear team
what did you try:
We are using a script to submit jobs using python (subprocess and/or os.system), the command uses srun, let me give you an example:
arg_str = " srun --mem=32000 --cpus-per-task=8 --time=60 --partition=private-wesolowski-cpu --job-name=FT-0 --nodes=1 --ntasks-per-node=1 --mail-user=Mingxue.Fu@unige.ch --mail-type=END run.sh fnt_emb.in"
os.system(arg_str)

and while it was working before, now it doesn’t work anymore. The weird thing is that a job starts, the memory is allocated, one can see the job running when using squeue, but the actual script (run.sh) that calls Q-Chem, never runs, instead it prints:
srun: Job 44117583 step creation temporarily disabled, retrying (Requested nodes are busy)

So, I would expect that if the nodes are busy it doesn’t allocate anything, is that correct? are we doing something wrong? Could you help us?

Thanks!

Cristina and Mingxue

Yann.Sagon · March 12, 2021, 1:20pm

Hi,

srun is a blocking command. It means it won’t return unless your job finish, correctly or not.

It means as well that if you are submitting your jobs from login node, if you session is closed your running and pending jobs are lost. It’s better to use sbatch and job arrays. You can use the --wrap flag if you don’t want to write a sbatch script.

There is a limit in Slurm about maximum number of jobs pending and running, maybe you’ve hit this limit.

The example job number you are talking about was launched by user ‘fum’.

Do you have a lot of pending or running at a given time?

Feel free to give us more details.

Cristina.GonzalezEspinoza · March 12, 2021, 2:45pm

Hi Yann,

No, it’s a problem that happened to Mingxue, her user name is fum indeed.
OK, the fact srun is a blocking command is fine, we have used it for exaclty that reason, the series of jobs we lunch need to be done in order, waiting for the previous to finish, rearranging files and submitting the next one.

As I said, this worked fine before, the current issue is that now, when the srun command is used:

srun --mem=32000 --cpus-per-task=8 --time=60 --partition=private-wesolowski-cpu --job-name=FT-0 --nodes=1 --ntasks-per-node=1 --mail-user=Mingxue.Fu@unige.ch --mail-type=END run.sh fnt_emb.in

it seems to be running something (because the job is indeed running as you could see) but in fact it does not run the entire script, it does the printing, but it does not call the last line. Here the script:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mail-user=Mingxue.Fu@unige.ch
#SBATCH --mail-type=END

####### SYNTAX #######
# sbatch [--mem=<MB>, etc.] QCsub [$1 = input file]

###### MISC ######
module purge
module load intel/2018a

####### Q-Chem env. variables #######
export QC=/home/share/wesolowski/qc5.july2020.FDE
export QCAUX=/home/share/wesolowski/qcaux_new
export QCSCRATCH=/scratch/$USER

####### File variables #######
InFile=$1
Extension="${InFile##*.}"
Filename="${InFile%.*}"

######## Job settings ###########
nthreads=$SLURM_CPUS_PER_TASK

# print job-specific settings
JOB_TIMELIMIT="$(sacct -j $SLURM_JOB_ID --format=timelimit | sed -n 3p)"
echo "SLURM job settings:"
echo "-- job ID         : $SLURM_JOB_ID"
echo "-- job name       : $SLURM_JOB_NAME" 
echo "-- cpus per task  : $SLURM_CPUS_PER_TASK"
echo "-- time limit     : $JOB_TIMELIMIT" 
echo "-- memory (/node) : $SLURM_MEM_PER_NODE"
echo "-- submitted from : $SLURM_SUBMIT_DIR"
echo "-- running node   : $SLURMD_NODENAME"
echo "---------------------------------------------------"

######## Run ###########
echo "Q-Chem 5.1 compiled with [intel mkl openmp release]"
echo "-- compiled March 2019"
echo "-- Intel compiler: intel/2016b"
echo "-- Enabled Q-Chem features: openmp, cosmo, intracule"

srun qchem -nt $nthreads $Filename.$Extension ${Filename}.out

So, is the problem using srun with an srun command? The weird part is that it worked before, many times, for a long time.

Thanks for the help.

Cristina

Yann.Sagon · March 12, 2021, 4:00pm

Hello,

Maybe this would fit your needs: SLURM chained jobs

If I understand well: your script run.sh is an sbatch script (the one you show in your message)?
And you are running this script using srun?

The srun commands doesn’t interpret the #SBATCH pragma in your script. Only sbatch is doing so.

Yes indeed, launching srun from inside srun is not a good practice.
You have probably this issue since we upgraded Slurm as they changed some default behavior.

If you want to stick with srun, you need to remove all the #SBATCH pragma in your script as they are not used, and remove the srun in front of qchem.

Best