Job id 35226904 stuck

Davide.Cucci · June 18, 2020, 11:16am

Dear all, I have a job array (35226904) stuck half way since two days and not seeming to resume.

Is there something special going on on the cluster or it is just busy?

Thanks a lot

Luca.Capello · June 18, 2020, 12:11pm

Hi there,

Actually, it is considered by slurmctld regularly, but still it has lower priority (cf. Job priority explanation ):

[root@login2 ~]# scontrol show Job=35226904 | \
 grep -e 'Reason' \
      -e 'LastSchedEval'
   JobState=PENDING Reason=Priority Dependency=(null)
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-06-18T14:02:26
[root@login2 ~]# sprio -u cucci
          JOBID PARTITION     USER   PRIORITY       SITE        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS
       35226904 mono-shar    cucci       4647          0         36        859          2       3750          0
[root@login2 ~]#

However, your job should not be accepted on the mono-shared-EL7 partition given that you asked for more per-CPU memory than permitted (10GB, cf. https://baobab.unige.ch/enduser/src/enduser/enduser.html#partitions-and-limits ):

[root@login2 ~]# scontrol show Job=35226904 | \
 grep -e Partition \
      -e TRES
   Partition=mono-shared-EL7 AllocNode:Sid=login2:138783
   TRES=cpu=1,mem=16000M,node=1,billing=1
[root@login2 ~]#

Should I change to another partition, shared-EL7 for example? Or do you prefer to reduce the requested memory?

Thx, bye,
Luca

Davide.Cucci · June 18, 2020, 12:27pm

Oh, I had a mistake in the max memory! It is fine with 10 GB I guess. Should I stop the job and run it again or can you change it for me?

I thought that since I am running single CPU jobs the mono-shared would be the optimal choice. What does it mean “Allowable per core” in the partitions-and-limits page?

Thanks a lot!

Davide.Cucci · June 18, 2020, 1:25pm

I think I managed to change the MinMemoryCPU myself. Would it be taken into account somehow in the priority?

Davide.Cucci · June 22, 2020, 3:05pm

Dear @Luca.Capello,

I am running very similar tasks since 2/3 months once in a while and normally they run immediately or very soon.

I have run another task this morning and (as far as I understand) it is still not even scheduled to run: (StartTime=Unknown), e.g., see below.

Has there been any change in the scheduling priority?

JobId=35497883 ArrayJobId=35497883 ArrayTaskId=13-500 JobName=navigation
   UserId=cucci(372700) GroupId=hpc_users(5000) MCS_label=N/A
   Priority=9903 Nice=0 Account=guerries QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2020-06-22T11:31:13 EligibleTime=2020-06-22T11:31:15
   AccrueTime=2020-06-22T11:31:15
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-06-22T17:01:17
   Partition=mono-shared-EL7 AllocNode:Sid=login2:132579
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=16000M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=16000M MinTmpDiskNode=0
   Features=V3|V4|V6 DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/users/c/cucci/navigation.sh
   WorkDir=/home/users/c/cucci
   StdErr=/home/users/c/cucci/results/navigation-35497883_4294967294.out
   StdIn=/dev/null
   StdOut=/home/users/c/cucci/results/navigation-35497883_4294967294.out
   Power=

Luca.Capello · June 23, 2020, 1:07pm

Hi there,

please note that to allow development of the cluster and not only bug fixes, only one member of the HPC team at the time is on “ticket” duty each day.

No, we did not change anything.

Your job arrays 13-500 finally started less than an hour after your post:

[root@login2 ~]# sacct -j 35497883 --format JobID%20,Start,NodeList
               JobID               Start        NodeList 
-------------------- ------------------- --------------- 
          35497883_1 2020-06-22T11:58:30         node222 
    35497883_1.batch 2020-06-22T11:58:30         node222 
        35497883_1.0 2020-06-22T11:58:31         node222 
          35497883_2 2020-06-22T11:58:30         node222 
    35497883_2.batch 2020-06-22T11:58:30         node222 
        35497883_2.0 2020-06-22T11:58:31         node222 
          35497883_3 2020-06-22T11:58:30         node222 
    35497883_3.batch 2020-06-22T11:58:30         node222 
        35497883_3.0 2020-06-22T11:58:31         node222 
          35497883_4 2020-06-22T11:58:30         node252 
    35497883_4.batch 2020-06-22T11:58:30         node252 
        35497883_4.0 2020-06-22T11:58:31         node252 
          35497883_5 2020-06-22T11:58:30         node252 
    35497883_5.batch 2020-06-22T11:58:30         node252 
        35497883_5.0 2020-06-22T11:58:31         node252 
          35497883_6 2020-06-22T11:58:30         node252 
    35497883_6.batch 2020-06-22T11:58:30         node252 
        35497883_6.0 2020-06-22T11:58:31         node252 
          35497883_7 2020-06-22T14:13:02         node222 
    35497883_7.batch 2020-06-22T14:13:02         node222 
        35497883_7.0 2020-06-22T14:13:04         node222 
          35497883_8 2020-06-22T14:13:02         node222 
    35497883_8.batch 2020-06-22T14:13:02         node222 
        35497883_8.0 2020-06-22T14:13:04         node222 
          35497883_9 2020-06-22T14:13:02         node222 
    35497883_9.batch 2020-06-22T14:13:02         node222 
        35497883_9.0 2020-06-22T14:13:04         node222 
         35497883_10 2020-06-22T14:13:02         node252 
   35497883_10.batch 2020-06-22T14:13:02         node252 
       35497883_10.0 2020-06-22T14:13:04         node252 
         35497883_11 2020-06-22T14:13:02         node252 
   35497883_11.batch 2020-06-22T14:13:02         node252 
       35497883_11.0 2020-06-22T14:13:04         node252 
         35497883_12 2020-06-22T14:13:02         node252 
   35497883_12.batch 2020-06-22T14:13:02         node252 
       35497883_12.0 2020-06-22T14:13:04         node252 
         35497883_13 2020-06-22T17:42:01         node222 
   35497883_13.batch 2020-06-22T17:42:01         node222 
       35497883_13.0 2020-06-22T17:42:02         node222 
         35497883_14 2020-06-22T17:42:01         node222 
   35497883_14.batch 2020-06-22T17:42:01         node222 
       35497883_14.0 2020-06-22T17:42:02         node222 
         35497883_15 2020-06-22T17:42:01         node222 
   35497883_15.batch 2020-06-22T17:42:01         node222 
       35497883_15.0 2020-06-22T17:42:02         node222 
[...]
[root@login2 ~]#

From the scontrol output you have posted you can see that in this case the job priority was even higher than the first job (9903 vs. 4647, cf. Job id 35226904 stuck - #2 by Luca.Capello ), nothing more I can do here.

Thx, bye,
Luca