what did you try:
We are using a script to submit jobs using python (subprocess and/or os.system), the command uses srun, let me give you an example:
arg_str = " srun --mem=32000 --cpus-per-task=8 --time=60 --partition=private-wesolowski-cpu --job-name=FT-0 --nodes=1 --ntasks-per-node=1 --mail-user=Mingxue.Fu@unige.ch --mail-type=END run.sh fnt_emb.in"
and while it was working before, now it doesn’t work anymore. The weird thing is that a job starts, the memory is allocated, one can see the job running when using squeue, but the actual script (run.sh) that calls Q-Chem, never runs, instead it prints:
srun: Job 44117583 step creation temporarily disabled, retrying (Requested nodes are busy)
So, I would expect that if the nodes are busy it doesn’t allocate anything, is that correct? are we doing something wrong? Could you help us?
srun is a blocking command. It means it won’t return unless your job finish, correctly or not.
It means as well that if you are submitting your jobs from login node, if you session is closed your running and pending jobs are lost. It’s better to use sbatch and job arrays. You can use the --wrap flag if you don’t want to write a sbatch script.
There is a limit in Slurm about maximum number of jobs pending and running, maybe you’ve hit this limit.
The example job number you are talking about was launched by user ‘fum’.
Do you have a lot of pending or running at a given time?
No, it’s a problem that happened to Mingxue, her user name is fum indeed.
OK, the fact srun is a blocking command is fine, we have used it for exaclty that reason, the series of jobs we lunch need to be done in order, waiting for the previous to finish, rearranging files and submitting the next one.
As I said, this worked fine before, the current issue is that now, when the srun command is used:
it seems to be running something (because the job is indeed running as you could see) but in fact it does not run the entire script, it does the printing, but it does not call the last line. Here the script: