Batch jobs stuck in pending for too long

Zhen.Liu · February 5, 2020, 12:51pm

Hello,
I submitted ~300 jobs to mono-shared-EL7, mono-EL7 and dpnc-EL7, the required time of my jobs is around 5h, the other sbatch script options except for the job names are just default. They have been stuck in a pending state for >1 day.
Is this a problem from my side or a common issue?

Thanks in advance.

Yann.Sagon · February 5, 2020, 5:35pm

Dear Zhen,

I see that you have 346 jobs in pending state:

[root@master ~]# squeue -u liuzhen7 | wc -l
346

135 in partition mono-EL7
105 in dpnc-EL7
105 in mono-shared-EL7

Example of a job in mono-EL7: job id: 29517960
This job has a time limit of 04:00:00 (4 hour). To have less wait time, you should submit this job to the partition mono-shared-EL7 as this partition is far bigger. The time limit of this partition is 12h00 which suit your needs.

Example of a job in partition dpnc-EL7: job id: 29517960
This job has a time limit of 05:00:00 (5 hours). To have less wait time, you should submit this job to the partition mono-shared-EL7 as this partition is far bigger. As you belong to the dpnc group, you can even specify both partition (comma separated).

Example of a job in partition mono-shared-EL7: job id 29517730
This partition is a good choice as you have a time limit of 08:00:00
This job has a priority of 5069

[root@master ~]# sprio  -j 29517730
          JOBID PARTITION   PRIORITY       SITE        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS
       29517730 mono-shar       5069          0         28       1289          2       3750          0

The issue in this case is that there is many job with a higher priority that are requesting 16 cores per job.

Slurm isn’t able to look in all the job in the pending queue. Right now, slurm only take into account jobs in the queue that would start in less than 28 days. This may seems a lot, but when a huge number of job is in the queue, this may be bigger. I’ve doubled this value right now (56 days) and I see that a lot of your jobs aren’t anymore in the queue. Did they start? I’ll keep this parameter with the 56 days value right now to see if it help to better schedule the jobs. My advice is as well that you submit your jobs to mono-shared-EL7.

Best

Zhen.Liu · February 5, 2020, 6:02pm

Dear Yann,
Thank you very much for your help.

All my jobs are still pending, so they haven’t started yet.
I will follow your advice using mono-shared-EL7 for short time jobs.

For my jobs in the mono-shared-EL7 partition, since they stuck in pending, so I should just wait.