SLURM not accepting more jobs

Jan-Philipp.Sasse · February 20, 2020, 5:47pm

Dear HPC community,

I am getting the following error.
sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying

Do you know possibly what’s wrong?

Luca.Capello · February 21, 2020, 10:22am

Hi there,

I tried at 09:37 and I was able to submit a job as a normal user:

[09:37:28] capello@login2:~$ cat test-slurm-chdir.sbatch 
#!/bin/sh

srun pwd
srun echo ${SLURM_WORKING_DIR}
[09:37:32] capello@login2:~$ sbatch test-slurm-chdir.sbatch 
Submitted batch job 30153612
[09:37:36] capello@login2:~$ sacct -j 30153612
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
30153612     test-slur+  debug-EL7   rossigno          1  COMPLETED      0:0 
30153612.ba+      batch              rossigno          1  COMPLETED      0:0 
30153612.0          pwd              rossigno          1  COMPLETED      0:0 
30153612.1         echo              rossigno          1  COMPLETED      0:0 
[11:16:17] capello@login2:~$

We had another report yesterday, for the same error message, but after a quick analysis nothing was strange in the Slurm controller and DB logs.

I saw you have RUNNING jobs now, does this mean that you could submit jobs after you reported the above error?

Thx, bye,
Luca

Volodymyr.Savchenko · February 21, 2020, 10:27am

Hi,

I had the same issue yesterday: though “sleeping and retrying” eventually resulted in the jobs being submitted and eventually executed.

However, this would seem to indicate SLURM was overloaded and unresponsive, delaying the user workflow.

Cheers

Volodymyr

Pablo.Strasser · February 21, 2020, 11:19am

I had exactly the same experience yesterday.

Luca.Capello · February 21, 2020, 1:31pm

Hi there,

Thanks to another notification from @Pablo.Strasser , I could find the exact time in the logs and indeed we were reaching the Slurm MaxJobCount (60’000 in our case).

We are now back to less than 20’000 in RUNNING or PENDING state, thus the problem should have been solved.

Thx, bye,
Luca

Jan-Philipp.Sasse · February 28, 2020, 9:15am

Hi Luca,

Thanks! This error was fixed! I am however running into a module error, I opened a different topic on that here in the forum.

Best, Jan