Job timeout despite not hitting the timelimit

Primary informations

Username: sonnerm
Cluster: yggdrasil

Description

Recently some of my jobs get cancelled with state “TIMEOUT” despite being days away from the time limit. This happened just now to all my running jobs at once, for example this job with jobid 27370833:

sonnerm@login1 ~ [1]> sacct -j 27370833 --format=JobID,JobName,Partition,State,Elapsed,Timelimit,Start
JobID           JobName  Partition      State    Elapsed  Timelimit               Start
------------ ---------- ---------- ---------- ---------- ---------- -------------------
27370833     interacti+ public-cpu    TIMEOUT 1-03:01:18 4-00:00:00 2023-09-14T13:28:23
27370833.in+ interacti+             CANCELLED 1-03:02:21            2023-09-14T13:28:23
27370833.ex+     extern             COMPLETED 1-03:01:18            2023-09-14T13:28:23

This jobs were all interactive, here is what was printed to the output of salloc (I adjusted the line breaks and indents to make it readable):

salloc: error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error
salloc: error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error
salloc: error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error                                                                                                                                                                                                   slurmstepd: error:
*** STEP 27762329.interactive ON cpu112 CANCELLED AT 2023-09-15T16:29:41 DUE TO TIME LIMIT ***
salloc: error: If munged is up, restart with --num-threads=10
salloc: error: Munge decode failed: Failed to receive message header: Timed-out
salloc: auth/munge: _print_cred: ENCODED: Thu Jan 01 00:59:59 1970
salloc: auth/munge: _print_cred: DECODED: Thu Jan 01 00:59:59 1970
salloc: error: slurm_unpack_received_msg: [[slurm1.yggdrasil]:35098] auth_g_verify: SRUN_PING has authentication error: Unspecified error
salloc: error: slurm_unpack_received_msg: [[slurm1.yggdrasil]:35098] Protocol authentication error
salloc: error: eio_message_socket_accept: slurm_receive_msg[192.168.212.20:35098]: Protocol authentication error
salloc: error: Mung$ decode failed: Expired credential
salloc: auth/munge: _print_cred: ENCODED: Fri Sep 15 16:29:41 2023
salloc: auth/munge: _print_cred: DECODED: Fri Sep 15 16:42:45 2023
salloc: error: slurm_unpack_received_msg: [[sl$rm1.yggdrasil]:38160] auth_g_verify: SRUN_TIMEOUT has authentication error: Unspecified error
salloc: error: slurm_unpack_received_msg: [[slurm1.yggdrasil]:38160] Protocol authentication error
salloc: error: eio_me$sage_socket_accept: slurm_receive_msg[192.168.212.20:38160]: Protocol authentication error
salloc: error: Munge decode failed: Expired credential
salloc: auth/munge: _print_cred: ENCODED: Fri Sep 15 16:29:41 2023                                                                                                                                                                                       salloc: auth/munge: _print_cred: DECODED: Fri Sep 15 16:42:45 2023
salloc: error: slurm_unpack_received_msg: [[slurm1.yggdrasil]:38168] auth_g_verify: RUN_JOB_COMPLETE has authentication error: Unspecified error
salloc: error: slurm_unpack_received_msg: [[slurm1.yggdrasil]:38168] Protocol authentication error
salloc: error: eio_message_socket_accept: slurm_receive_msg[192.168.212.20:38168]: Protocol authentication error
srun: error: Munge decode failed: Expired credential

For now i just restart the affected jobs and cross my fingers.

Steps to Reproduce

This problem is sporadic, I don’t know how to reproduce it.

Expected Result

That jobs only get cancelled after actually hitting the time limit.

Actual Result

The jobs got cancelled early.

Cheers,
Michael

This problem keeps happening (see job_id 27817432 and 27817372) :slightly_frowning_face:

Dear Michael,

These jobs have reach the wall time, so they have been killed by scheduler. What’s the partition you use and how long did you reserve time for your jobs?

(yggdrasil)-[root@slurm1 ~]$ seff 27817432
Job ID: 27817432
Cluster: yggdrasil
User/Group: sonnerm/unige
State: TIMEOUT (exit code 0)
Nodes: 1
Cores per node: 14
CPU Utilized: 00:00:01
CPU Efficiency: 0.00% of 22-12:55:58 core-walltime
Job Wall-clock time: 1-14:38:17
Memory Utilized: 23.43 GB
Memory Efficiency: 15.62% of 150.00 GB

Best regards,

Hi Gael,
The time limit was set to 4 days and the partition was public-cpu. So it should not have been killed after just one day, 14 hours and 38 minutes. Here is the output from sacct:

sonnerm@login1 ~> sacct -j 27817432 --format=JobID,JobName,Partition,State,Elapsed,Timelimit,Start
JobID           JobName  Partition      State    Elapsed  Timelimit               Start
------------ ---------- ---------- ---------- ---------- ---------- -------------------
27817432     interacti+ public-cpu    TIMEOUT 1-14:38:17 4-00:00:00 2023-09-16T01:36:44
27817432.in+ interacti+             CANCELLED 1-14:39:20            2023-09-16T01:36:44
27817432.ex+     extern             COMPLETED 1-14:38:17            2023-09-16T01:36:44

Given that this happened to different jobs at the same instant in time, but at different elapsed time:

sonnerm@login1 ~> sacct -j 27762329 --format=JobID,JobName,Partition,State,Elapsed,Timelimit,Start,End
JobID           JobName  Partition      State    Elapsed  Timelimit               Start                 End
------------ ---------- ---------- ---------- ---------- ---------- ------------------- -------------------
27762329     interacti+ public-bi+    TIMEOUT   03:29:21 4-00:00:00 2023-09-15T13:00:20 2023-09-15T16:29:41
27762329.in+ interacti+             CANCELLED   03:30:23            2023-09-15T13:00:20 2023-09-15T16:30:43
27762329.ex+     extern             COMPLETED   03:29:21            2023-09-15T13:00:20 2023-09-15T16:29:41
sonnerm@login1 ~> sacct -j 27370833 --format=JobID,JobName,Partition,State,Elapsed,Timelimit,Start,End
JobID           JobName  Partition      State    Elapsed  Timelimit               Start                 End
------------ ---------- ---------- ---------- ---------- ---------- ------------------- -------------------
27370833     interacti+ public-cpu    TIMEOUT 1-03:01:18 4-00:00:00 2023-09-14T13:28:23 2023-09-15T16:29:41
27370833.in+ interacti+             CANCELLED 1-03:02:21            2023-09-14T13:28:23 2023-09-15T16:30:44
27370833.ex+     extern             COMPLETED 1-03:01:18            2023-09-14T13:28:23 2023-09-15T16:29:41
sonnerm@login1 ~>

My guess is that there was some issue affecting the cluster at this point in time (that would be for example 2023-09-15T16:29:41 or somewhat earlier, since the three slurm_send_node_msg errors appeared before the cancellation).
Given that nobody else complained (so far?) this problem might only affect interactive jobs, maybe it was a communication problem with the login node?

Cheers,
Michael

PS: seff seems like a really cool tool to check memory requirements, I did not know about it!

You’re right because the timeout on 1 day, 14 hours 38 minutes is not a wall time defined in the partition. Could you please provide you sbatch to hpc@unige.ch to check if something wrong?

Best regards,

Hi,
I did not use sbatch for this job, since it was an interactive job, instead I ran salloc, in this case

salloc --time 4-00:00:00 --partition public-cpu --cpus-per-task 14 --mem 150G

Edit: Most of the time this works fine, however, there were a couple instances, especially recently, were my jobs got cancelled unexpectedly as described above.

Cheers
Michael

Dear @Michael.Sonner we have asked the network team to monitor the outgoing network on login1.yggdrasil to see if the network link is saturated from time to time. If your interactive job is blocked for too many seconds by “something”, it is killed du to “inactivity” by Slurm.

We’ll close the issue as we yet know “why” the job is killed but not the exact reason that trigger the timeout. We’ll monitor more closely login1.

1 Like

Hi hpc-team,
Earlier today this happened again (jobid 28026950) and when I logged in around that that time (2023-09-26T15:23:59) slurm was not reachable, squeue gave the error
slurm_load_jobs error: Unable to contact slurm controller (connect failure).
This issue seems specific to yggdrasil, as I never experienced it on baobab.

Cheers,
Michael

Dear Michael,

We have found the issue, it was related to the process to add new users on the cluster. We were restarting the slurmctld to update users list but it take too much time restarting and you could have “connect_faillure” errors.

We update our procedure to avoid restarting the slurmctld and now all need to be safe.

Sorry for inconvenience,

1 Like