Dear all, I have a job array (35226904) stuck half way since two days and not seeming to resume.
Is there something special going on on the cluster or it is just busy?
Thanks a lot
Dear all, I have a job array (35226904) stuck half way since two days and not seeming to resume.
Is there something special going on on the cluster or it is just busy?
Thanks a lot
Hi there,
Actually, it is considered by slurmctld
regularly, but still it has lower priority (cf. Job priority explanation ):
[root@login2 ~]# scontrol show Job=35226904 | \
grep -e 'Reason' \
-e 'LastSchedEval'
JobState=PENDING Reason=Priority Dependency=(null)
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-06-18T14:02:26
[root@login2 ~]# sprio -u cucci
JOBID PARTITION USER PRIORITY SITE AGE FAIRSHARE JOBSIZE PARTITION QOS
35226904 mono-shar cucci 4647 0 36 859 2 3750 0
[root@login2 ~]#
However, your job should not be accepted on the mono-shared-EL7
partition given that you asked for more per-CPU memory than permitted (10GB, cf. https://baobab.unige.ch/enduser/src/enduser/enduser.html#partitions-and-limits ):
[root@login2 ~]# scontrol show Job=35226904 | \
grep -e Partition \
-e TRES
Partition=mono-shared-EL7 AllocNode:Sid=login2:138783
TRES=cpu=1,mem=16000M,node=1,billing=1
[root@login2 ~]#
Should I change to another partition, shared-EL7
for example? Or do you prefer to reduce the requested memory?
Thx, bye,
Luca
Oh, I had a mistake in the max memory! It is fine with 10 GB I guess. Should I stop the job and run it again or can you change it for me?
I thought that since I am running single CPU jobs the mono-shared would be the optimal choice. What does it mean “Allowable per core” in the partitions-and-limits page?
Thanks a lot!
I think I managed to change the MinMemoryCPU myself. Would it be taken into account somehow in the priority?
Dear @Luca.Capello,
I am running very similar tasks since 2/3 months once in a while and normally they run immediately or very soon.
I have run another task this morning and (as far as I understand) it is still not even scheduled to run: (StartTime=Unknown)
, e.g., see below.
Has there been any change in the scheduling priority?
JobId=35497883 ArrayJobId=35497883 ArrayTaskId=13-500 JobName=navigation
UserId=cucci(372700) GroupId=hpc_users(5000) MCS_label=N/A
Priority=9903 Nice=0 Account=guerries QOS=normal
JobState=PENDING Reason=Priority Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=12:00:00 TimeMin=N/A
SubmitTime=2020-06-22T11:31:13 EligibleTime=2020-06-22T11:31:15
AccrueTime=2020-06-22T11:31:15
StartTime=Unknown EndTime=Unknown Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-06-22T17:01:17
Partition=mono-shared-EL7 AllocNode:Sid=login2:132579
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=16000M,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=16000M MinTmpDiskNode=0
Features=V3|V4|V6 DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/users/c/cucci/navigation.sh
WorkDir=/home/users/c/cucci
StdErr=/home/users/c/cucci/results/navigation-35497883_4294967294.out
StdIn=/dev/null
StdOut=/home/users/c/cucci/results/navigation-35497883_4294967294.out
Power=
Hi there,
please note that to allow development of the cluster and not only bug fixes, only one member of the HPC team at the time is on “ticket” duty each day.
No, we did not change anything.
Your job arrays 13-500 finally started less than an hour after your post:
[root@login2 ~]# sacct -j 35497883 --format JobID%20,Start,NodeList
JobID Start NodeList
-------------------- ------------------- ---------------
35497883_1 2020-06-22T11:58:30 node222
35497883_1.batch 2020-06-22T11:58:30 node222
35497883_1.0 2020-06-22T11:58:31 node222
35497883_2 2020-06-22T11:58:30 node222
35497883_2.batch 2020-06-22T11:58:30 node222
35497883_2.0 2020-06-22T11:58:31 node222
35497883_3 2020-06-22T11:58:30 node222
35497883_3.batch 2020-06-22T11:58:30 node222
35497883_3.0 2020-06-22T11:58:31 node222
35497883_4 2020-06-22T11:58:30 node252
35497883_4.batch 2020-06-22T11:58:30 node252
35497883_4.0 2020-06-22T11:58:31 node252
35497883_5 2020-06-22T11:58:30 node252
35497883_5.batch 2020-06-22T11:58:30 node252
35497883_5.0 2020-06-22T11:58:31 node252
35497883_6 2020-06-22T11:58:30 node252
35497883_6.batch 2020-06-22T11:58:30 node252
35497883_6.0 2020-06-22T11:58:31 node252
35497883_7 2020-06-22T14:13:02 node222
35497883_7.batch 2020-06-22T14:13:02 node222
35497883_7.0 2020-06-22T14:13:04 node222
35497883_8 2020-06-22T14:13:02 node222
35497883_8.batch 2020-06-22T14:13:02 node222
35497883_8.0 2020-06-22T14:13:04 node222
35497883_9 2020-06-22T14:13:02 node222
35497883_9.batch 2020-06-22T14:13:02 node222
35497883_9.0 2020-06-22T14:13:04 node222
35497883_10 2020-06-22T14:13:02 node252
35497883_10.batch 2020-06-22T14:13:02 node252
35497883_10.0 2020-06-22T14:13:04 node252
35497883_11 2020-06-22T14:13:02 node252
35497883_11.batch 2020-06-22T14:13:02 node252
35497883_11.0 2020-06-22T14:13:04 node252
35497883_12 2020-06-22T14:13:02 node252
35497883_12.batch 2020-06-22T14:13:02 node252
35497883_12.0 2020-06-22T14:13:04 node252
35497883_13 2020-06-22T17:42:01 node222
35497883_13.batch 2020-06-22T17:42:01 node222
35497883_13.0 2020-06-22T17:42:02 node222
35497883_14 2020-06-22T17:42:01 node222
35497883_14.batch 2020-06-22T17:42:01 node222
35497883_14.0 2020-06-22T17:42:02 node222
35497883_15 2020-06-22T17:42:01 node222
35497883_15.batch 2020-06-22T17:42:01 node222
35497883_15.0 2020-06-22T17:42:02 node222
[...]
[root@login2 ~]#
From the scontrol
output you have posted you can see that in this case the job priority was even higher than the first job (9903 vs. 4647, cf. Job id 35226904 stuck - #2 by Luca.Capello ), nothing more I can do here.
Thx, bye,
Luca