Strange oom-kill event

Denis.Mongin · November 21, 2020, 10:45pm

I am facing a difficulty I do not understand.
I am trying to run a job array of R scripts.
When I make the test of one job on debug-EL7, it runs without problems

#!/bin/bash
#SBATCH --job-name=simu_carrac_test
#SBATCH --cpus-per-task=2
#SBATCH --mail-user=denis.mongin@unige.ch
#SBATCH --mail-type=ALL
#SBATCH --time=00:15:00
#SBATCH --partition=debug-EL7
#SBATCH --array=1
#SBATCH --output=slurm-%A_%a.out
#SBATCH --mem=8000

module load foss/2018b R/3.5.1


srun Rscript --verbose simuRRwithAttrition_multijob_test.R ${SLURM_ARRAY_TASK_ID} > simuRRwithAttrition_multijob_test.Rout

. When I run the same job on shared-EL7

#!/bin/bash
#SBATCH --job-name=simu_carrac
#SBATCH --cpus-per-task=2
#SBATCH --mail-user=denis.mongin@unige.ch
#SBATCH --mail-type=ALL
#SBATCH --time=7:00:00
#SBATCH --partition=shared-EL7
#SBATCH --array=1
#SBATCH --output=slurm-%A_%a.out
#SBATCH --mem=8000

module load foss/2018b R/3.5.1

srun Rscript --verbose simuRRwithAttrition_multijob_test.R ${SLURM_ARRAY_TASK_ID} > simuRRwithAttrition_multijob_test.Rout

, the job stop after few tens of seconds and I get

slurmstepd: error: Detected 30 oom-kill event(s) in step 40561239.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: node007: task 1: Out Of Memory
srun: First task exited 30s ago
srun: step:40561239.0 tasks 0,2,4: running
srun: step:40561239.0 tasks 1,3,5-7: exited abnormally
srun: Terminating job step 40561239.0
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 40561239.0 ON node007 CANCELLED AT 2020-11-21T23:27:34 ***

when I ssh to the node executing this job (node007), I see multiple instance of my R script running

Whereas on debug I only have one R script running.

I guess this cause the memory problem: when my script start to use a bit more of memory, the fact that multiple instances are at the same time active consume too much memory.

The problem is that I don’t call multiple instances, and that this does not occur on debug. What is happening here ?

Yann.Sagon · November 23, 2020, 3:44pm

Dear @Denis.Mongin

According to the job number you showed us, your job is running 8 tasks:

[root@login2 ~]#  sacct -j 40561239.0 --format="AllocCPUS,NTasks"
 AllocCPUS   NTasks
---------- --------
        16        8

Are you sure you showed us the correct sbatch script?

Denis.Mongin · November 23, 2020, 4:44pm

Yes I am sure. And to verify, I just re-executed, in /carrac/

sbatch simu_carrac_baobab.sh

copied from the consol nano simu_carrac_baobab.sh:


#!/bin/bash
#SBATCH --job-name=simu_carrac
#SBATCH --cpus-per-task=2
#SBATCH --mail-user=denis.mongin@unige.ch
#SBATCH --mail-type=ALL
#SBATCH --time=7:00:00
#SBATCH --partition=shared-EL7
#SBATCH --array=1
#SBATCH --output=slurm-%A_%a.out
#SBATCH --mem=8000

module load foss/2018b R/3.5.1

srun Rscript --verbose simuRRwithAttrition_multijob_test.R ${SLURM_ARRAY_TASK_I$

which is the sbatch I showed. I obtain:


slurmstepd: error: Detected 1 oom-kill event(s) in step 40617586.0 cgroup. Some$
srun: error: node013: task 7: Out Of Memory
srun: First task exited 30s ago
srun: step:40617586.0 tasks 2,4,6: running
srun: step:40617586.0 tasks 0-1,3,5,7: exited abnormally
srun: Terminating job step 40617586.0
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 40617586.0 ON node013 CANCELLED AT 2020-11-23T17:42$
slurmstepd: error: Detected 27 oom-kill event(s) in step 40617586.batch cgroup.$

whereas the same R script, lauched on debug with

sbatch simu_carrac_baobab_test.sh

works perfectly.
I do not understand why it launches several jobs on shared (and not on debug), and so crash the memory.

Yann.Sagon · November 23, 2020, 5:37pm

Hello,

indeed this seems to be a bug or a feature:) I could reproduce the issue myself in a simple script.

I’ve opened a ticket at schedmd.

Thanks for the feedback!

Yann

Denis.Mongin · November 23, 2020, 9:56pm

Do you have any idea how much time it will take, or any way of avoiding it ?
The simulation is the final run for a paper that need to be finished soon.

Thank you for your help

Pablo.Strasser · November 24, 2020, 1:09am

Try maybe without using the cpu per task option and just cpu if you need more than the default 1.

Yann.Sagon · November 24, 2020, 8:32am

Yes, please use partition mono-shared-EL7 instead of shared-EL7 This should work as the parameters of this partition are the same as the debug-EL7.

Denis.Mongin · November 24, 2020, 8:44am

ok I ll try that. I actually just need one, I set it to 2 with memory because of the memory problem I was facing

Denis.Mongin · November 24, 2020, 8:45am

Perfect ! I ll do that, thanks a bunch

Denis.Mongin · November 25, 2020, 8:07am

It worked like a charm.
Merci l’équipe, super service et réactivité, comme d’hab.