Ever since the baobab maintenance finished last week, my interactive jobs started to get killed before they are meant to expire.
For example, today I submitted jobs with 8 hours runtime: salloc -n1 -c16 --partition=public-cpu,private-dpnc-cpu,shared-cpu --time=08:00:00 --mem=20G
But each time the job got killed after about 1-2 hours:
➜ weakly-supervised-search (cathode-improvements) ✗ srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
salloc: Job allocation 6129035 has been revoked.
[2025-12-08T15:44:50.841] error: *** STEP 6129035.interactive ON cpu089 CANCELLED AT 2025-12-08T15:44:50 DUE TO TIME LIMIT ***
srun: error: cpu089: task 0: Killed
This happened at least twice on different nodes. Could you please help me with this?
[2025-12-08T14:56:12.460] sched: _slurm_rpc_allocate_resources JobId=6129035 NodeList=(null) usec=5978
[2025-12-08T14:56:13.005] sched: Allocate JobId=6129035 NodeList=cpu089 #CPUs=16 Partition=private-dpnc-cpu
[2025-12-08T15:41:21.003] job_time_limit: inactivity time limit reached for JobId=6129035
[2025-12-08T15:46:20.880] cleanup_completing: JobId=6129035 completion process took 299 seconds
To clarify, was the job terminated while a process was still running under salloc?
We are also considering that this may be related to network issues, which could cause similar behaviour, and we are currently investigating this as part of the ongoing cluster incident:
Thank you very much for your response and for looking into this.
I was running a vscode server and snakemake on the node in question, but perhaps the scheduler thought the job was idle. Also, I ran salloc within tmux when it got killed.
I need my jobs to run non-interrupted as I have snakemake running on the allocated job managing my workflow, which can take up to a day to finish. Do you have any suggestions what to do?
Best regards,
Vilius
salloc: Job allocation 6547165 has been revoked.
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
[2026-01-14T17:37:06.137] error: *** STEP 6547165.interactive ON cpu239 CANCELLED AT 2026-01-14T17:37:06 DUE TO TIME LIMIT ***
srun: error: cpu239: task 0: Killed
[2026-01-14T17:30:45.011] job_time_limit: inactivity time limit reached for JobId=6547165
[2026-01-14T17:38:35.992] cleanup_completing: JobId=6547165 completion process took 470
seconds
Your job has been killed for ‘inactivity’.
The SLURM FAQ:
Why is my job killed prematurely?
Slurm has a job purging mechanism to remove inactive jobs (resource allocations) before reaching its time limit, which could be infinite. This inactivity time limit is configurable by the system administrator. You can check its value with the command
scontrol show config | grep InactiveLimit
The value of InactiveLimit is in seconds. A zero value indicates that job purging is disabled. A job is considered inactive if it has no active job steps or if the srun command creating the job is not responding. In the case of a batch job, the srun command terminates after the job script is submitted. Therefore batch job pre- and post-processing is limited to the InactiveLimit. Contact your system administrator if you believe the InactiveLimit value should be changed.
@maciej.falkiewicz are you using Tmux or another tool instead shell ?
It could be a network issue beetween your machine and the cluster impacting the ssh process.
yes @Adrien.Albert , I do use tmux. There can be a network issue between login node and compute node, but the termination wouldn’t happen with TIMEOUT, no?
I can try the same with sbatch if you reserve the node for me
my two cents to optimize your job submission: the shared-cpu max time is 12h and you are requesting 24h. Do not add this partition in the list of requested partition OR use 12h00 as time limit.
This won’t fix your issue but it is better for the scheduling.
Yes, the job status shows TIMEOUT, but the Slurm logs indicate also the cause is inactivity, meaning Slurm considered the interactive session (salloc / srun) as not responding.
We also know this does not happen with sbatch, because batch jobs do not depend on the stability of your SSH connection.
Another user recently had exactly the same issue, and migrating to sbatch fully solved his problem.
For this reason, we very strongly recommend running jobs via sbatch whenever possible.
With salloc and srun, any network disruption can cause Slurm to kill the job.
Have you observed the same behavior when using OpenOnDemand?
OOD keeps the session on the server side, so it is more tolerant to network instability.
Also keep in mind: a laptop that goes into sleep mode or closes its lid often loses network connectivity, which can trigger this situation.
My test:
To confirm the root cause, I performed the following test as user:
SSH connection to login node
Start a screen session:screen -S test
Run sallocinside this screen
Detach from the screen (Ctrl-A D)
Disconnect from the login node entirely
The salloc is doing “nothing” (just an idle shell waiting for input).
Result:
The salloc allocation continued running for more than 30 minutes without interruption, even though the SSH connection was closed.
Example (sacct output):
(baobab)-[root@login1 ~]$ sac -u alberta
JobID JobName Account User NodeList ReqCPUS ReqMem NTasks Start End Elapsed State
--------------- ---------- ---------- --------- --------------- -------- ---------- -------- ------------------- ------------------- ---------- ----------
6562597 interacti+ burgi alberta cpu349 1 3000M 2026-01-16T11:02:56 Unknown 00:34:13 RUNNING
This test orients my suspicion on network interruptions between your laptop and the login node.
actually, this sounds like a possible explanation. My job was scheduled for 7days, but I had shared-gpu in the partition’s list. But the TIMEOUT was after ~24 hours, not 12
After the TIMEOUT took place, I scheduled the same job but only with private partitions, and the job is running 1day +
Thank you very much for your investigations and suggestions. I will also switch to using sbatch.
Perhaps as a useful datapoint, my setup was also connecting to the login node first, starting a tmux session there and running salloc from inside tmux when the timeouts happened.
Same as Vilius here.
I’m using tmux and salloc, and ever since december i have also experienced this premature timeout.
I understand that yes, we could also use sbatch for this, but I think a lot of people have been using tmux + salloc and this has suddenly become a problem (without us modifying our workflow)
This happened to me today (and yes it is a tmux session)
(baobab)-\[algren@gpu023 resources\]$ ctsalloc: Job allocation 6561472 has been revoked.
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
\[2026-01-16T14:02:17.857\] error: \*\*\* STEP 6561472.interactive ON gpu023 CANCELLED AT 2026-01-16T14:02:17 DUE TO TIME LIMIT \*\*\*
srun: error: gpu023: task 0: Killed
(baobab)-\[algren@login1 \~\]$
(baobab)-\[algren@login1 \~\]$
(baobab)-\[algren@login1 \~\]$
…
(baobab)-\[algren@login1 \~\]$ salloc -c4 --partition=private-dpnc-gpu,shared-gpu, --time=00-12:00:00 --mem=32GB --gres=gpu:1,VramPerGpu:2G
salloc: Pending job allocation 6570106
salloc: job 6570106 queued and waiting for resources
salloc: job 6570106 has been allocated resources
salloc: Granted job allocation 6570106
salloc: Waiting for resource configuration
salloc: Nodes gpu044 are ready for job
(baobab)-\[algren@gpu044 \~\]$ srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
salloc: Job allocation 6570106 has been revoked.
\[2026-01-16T14:19:31.959\] error: \*\*\* STEP 6570106.interactive ON gpu044 CANCELLED AT 2026-01-16T14:19:31 DUE TO TIME LIMIT \*\*\*
srun: error: gpu044: task 0: Killed
(baobab)-\[algren@login1 \~\]$ salloc -c4 --partition=private-dpnc-gpu,shared-gpu, --time=01-12:00:00 --mem=32GB --gres=gpu:1,VramPerGpu:2G
(baobab)-\[algren@gpu023 \~\]$ diffgae