Dear all, I have a job array (35226904) stuck half way since two days and not seeming to resume.

Is there something special going on on the cluster or it is just busy?

Actually, it is considered by slurmctld regularly, but still it has lower priority (cf. Job priority explanation ):

[root@login2 ~]# scontrol show Job=35226904 | \
 grep -e 'Reason' \
      -e 'LastSchedEval'
   JobState=PENDING Reason=Priority Dependency=(null)
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-06-18T14:02:26
[root@login2 ~]# sprio -u cucci
       35226904 mono-shar    cucci       4647          0         36        859          2       3750          0
[root@login2 ~]# 

However, your job should not be accepted on the mono-shared-EL7 partition given that you asked for more per-CPU memory than permitted (10GB, cf. ):

[root@login2 ~]# scontrol show Job=35226904 | \
 grep -e Partition \
      -e TRES
   Partition=mono-shared-EL7 AllocNode:Sid=login2:138783
[root@login2 ~]# 

Should I change to another partition, shared-EL7 for example? Or do you prefer to reduce the requested memory?

Oh, I had a mistake in the max memory! It is fine with 10 GB I guess. Should I stop the job and run it again or can you change it for me?

I thought that since I am running single CPU jobs the mono-shared would be the optimal choice. What does it mean “Allowable per core” in the partitions-and-limits page?

I think I managed to change the MinMemoryCPU myself. Would it be taken into account somehow in the priority?

Dear @Luca.Capello,

I am running very similar tasks since 2/3 months once in a while and normally they run immediately or very soon.

I have run another task this morning and (as far as I understand) it is still not even scheduled to run: (StartTime=Unknown), e.g., see below.

Has there been any change in the scheduling priority?

JobId=35497883 ArrayJobId=35497883 ArrayTaskId=13-500 JobName=navigation
   UserId=cucci(372700) GroupId=hpc_users(5000) MCS_label=N/A
   Priority=9903 Nice=0 Account=guerries QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2020-06-22T11:31:13 EligibleTime=2020-06-22T11:31:15
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-06-22T17:01:17
   Partition=mono-shared-EL7 AllocNode:Sid=login2:132579
   ReqNodeList=(null) ExcNodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=16000M MinTmpDiskNode=0
   Features=V3|V4|V6 DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

Hi there,

please note that to allow development of the cluster and not only bug fixes, only one member of the HPC team at the time is on “ticket” duty each day.

No, we did not change anything.

Your job arrays 13-500 finally started less than an hour after your post:

From the scontrol output you have posted you can see that in this case the job priority was even higher than the first job (9903 vs. 4647, cf. Job id 35226904 stuck - #2 by Luca.Capello ), nothing more I can do here.

