Ever since the baobab maintenance finished last week, my interactive jobs started to get killed before they are meant to expire.
For example, today I submitted jobs with 8 hours runtime: salloc -n1 -c16 --partition=public-cpu,private-dpnc-cpu,shared-cpu --time=08:00:00 --mem=20G
But each time the job got killed after about 1-2 hours:
➜ weakly-supervised-search (cathode-improvements) ✗ srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
salloc: Job allocation 6129035 has been revoked.
[2025-12-08T15:44:50.841] error: *** STEP 6129035.interactive ON cpu089 CANCELLED AT 2025-12-08T15:44:50 DUE TO TIME LIMIT ***
srun: error: cpu089: task 0: Killed
This happened at least twice on different nodes. Could you please help me with this?
[2025-12-08T14:56:12.460] sched: _slurm_rpc_allocate_resources JobId=6129035 NodeList=(null) usec=5978
[2025-12-08T14:56:13.005] sched: Allocate JobId=6129035 NodeList=cpu089 #CPUs=16 Partition=private-dpnc-cpu
[2025-12-08T15:41:21.003] job_time_limit: inactivity time limit reached for JobId=6129035
[2025-12-08T15:46:20.880] cleanup_completing: JobId=6129035 completion process took 299 seconds
To clarify, was the job terminated while a process was still running under salloc?
We are also considering that this may be related to network issues, which could cause similar behaviour, and we are currently investigating this as part of the ongoing cluster incident:
Thank you very much for your response and for looking into this.
I was running a vscode server and snakemake on the node in question, but perhaps the scheduler thought the job was idle. Also, I ran salloc within tmux when it got killed.