Primary informations
Username: sonnerm
Cluster: yggdrasil
Description
Recently some of my jobs get cancelled with state “TIMEOUT” despite being days away from the time limit. This happened just now to all my running jobs at once, for example this job with jobid 27370833:
sonnerm@login1 ~ [1]> sacct -j 27370833 --format=JobID,JobName,Partition,State,Elapsed,Timelimit,Start
JobID JobName Partition State Elapsed Timelimit Start
------------ ---------- ---------- ---------- ---------- ---------- -------------------
27370833 interacti+ public-cpu TIMEOUT 1-03:01:18 4-00:00:00 2023-09-14T13:28:23
27370833.in+ interacti+ CANCELLED 1-03:02:21 2023-09-14T13:28:23
27370833.ex+ extern COMPLETED 1-03:01:18 2023-09-14T13:28:23
This jobs were all interactive, here is what was printed to the output of salloc (I adjusted the line breaks and indents to make it readable):
salloc: error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error
salloc: error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error
salloc: error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error slurmstepd: error:
*** STEP 27762329.interactive ON cpu112 CANCELLED AT 2023-09-15T16:29:41 DUE TO TIME LIMIT ***
salloc: error: If munged is up, restart with --num-threads=10
salloc: error: Munge decode failed: Failed to receive message header: Timed-out
salloc: auth/munge: _print_cred: ENCODED: Thu Jan 01 00:59:59 1970
salloc: auth/munge: _print_cred: DECODED: Thu Jan 01 00:59:59 1970
salloc: error: slurm_unpack_received_msg: [[slurm1.yggdrasil]:35098] auth_g_verify: SRUN_PING has authentication error: Unspecified error
salloc: error: slurm_unpack_received_msg: [[slurm1.yggdrasil]:35098] Protocol authentication error
salloc: error: eio_message_socket_accept: slurm_receive_msg[192.168.212.20:35098]: Protocol authentication error
salloc: error: Mung$ decode failed: Expired credential
salloc: auth/munge: _print_cred: ENCODED: Fri Sep 15 16:29:41 2023
salloc: auth/munge: _print_cred: DECODED: Fri Sep 15 16:42:45 2023
salloc: error: slurm_unpack_received_msg: [[sl$rm1.yggdrasil]:38160] auth_g_verify: SRUN_TIMEOUT has authentication error: Unspecified error
salloc: error: slurm_unpack_received_msg: [[slurm1.yggdrasil]:38160] Protocol authentication error
salloc: error: eio_me$sage_socket_accept: slurm_receive_msg[192.168.212.20:38160]: Protocol authentication error
salloc: error: Munge decode failed: Expired credential
salloc: auth/munge: _print_cred: ENCODED: Fri Sep 15 16:29:41 2023 salloc: auth/munge: _print_cred: DECODED: Fri Sep 15 16:42:45 2023
salloc: error: slurm_unpack_received_msg: [[slurm1.yggdrasil]:38168] auth_g_verify: RUN_JOB_COMPLETE has authentication error: Unspecified error
salloc: error: slurm_unpack_received_msg: [[slurm1.yggdrasil]:38168] Protocol authentication error
salloc: error: eio_message_socket_accept: slurm_receive_msg[192.168.212.20:38168]: Protocol authentication error
srun: error: Munge decode failed: Expired credential
For now i just restart the affected jobs and cross my fingers.
Steps to Reproduce
This problem is sporadic, I don’t know how to reproduce it.
Expected Result
That jobs only get cancelled after actually hitting the time limit.
Actual Result
The jobs got cancelled early.
Cheers,
Michael