Strange oom-kill event

I am facing a difficulty I do not understand.
I am trying to run a job array of R scripts.
When I make the test of one job on debug-EL7, it runs without problems

#!/bin/bash
#SBATCH --job-name=simu_carrac_test
#SBATCH --cpus-per-task=2
#SBATCH --mail-user=denis.mongin@unige.ch
#SBATCH --mail-type=ALL
#SBATCH --time=00:15:00
#SBATCH --partition=debug-EL7
#SBATCH --array=1
#SBATCH --output=slurm-%A_%a.out
#SBATCH --mem=8000

module load foss/2018b R/3.5.1


srun Rscript --verbose simuRRwithAttrition_multijob_test.R ${SLURM_ARRAY_TASK_ID} > simuRRwithAttrition_multijob_test.Rout


. When I run the same job on shared-EL7

#!/bin/bash
#SBATCH --job-name=simu_carrac
#SBATCH --cpus-per-task=2
#SBATCH --mail-user=denis.mongin@unige.ch
#SBATCH --mail-type=ALL
#SBATCH --time=7:00:00
#SBATCH --partition=shared-EL7
#SBATCH --array=1
#SBATCH --output=slurm-%A_%a.out
#SBATCH --mem=8000

module load foss/2018b R/3.5.1

srun Rscript --verbose simuRRwithAttrition_multijob_test.R ${SLURM_ARRAY_TASK_ID} > simuRRwithAttrition_multijob_test.Rout

, the job stop after few tens of seconds and I get

slurmstepd: error: Detected 30 oom-kill event(s) in step 40561239.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: node007: task 1: Out Of Memory
srun: First task exited 30s ago
srun: step:40561239.0 tasks 0,2,4: running
srun: step:40561239.0 tasks 1,3,5-7: exited abnormally
srun: Terminating job step 40561239.0
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 40561239.0 ON node007 CANCELLED AT 2020-11-21T23:27:34 ***

when I ssh to the node executing this job (node007), I see multiple instance of my R script running

Whereas on debug I only have one R script running.

I guess this cause the memory problem: when my script start to use a bit more of memory, the fact that multiple instances are at the same time active consume too much memory.

The problem is that I don’t call multiple instances, and that this does not occur on debug. What is happening here ?

Dear @Denis.Mongin

According to the job number you showed us, your job is running 8 tasks:

[root@login2 ~]#  sacct -j 40561239.0 --format="AllocCPUS,NTasks"
 AllocCPUS   NTasks
---------- --------
        16        8

Are you sure you showed us the correct sbatch script?

Yes I am sure. And to verify, I just re-executed, in /carrac/

sbatch simu_carrac_baobab.sh

copied from the consol nano simu_carrac_baobab.sh:


#!/bin/bash
#SBATCH --job-name=simu_carrac
#SBATCH --cpus-per-task=2
#SBATCH --mail-user=denis.mongin@unige.ch
#SBATCH --mail-type=ALL
#SBATCH --time=7:00:00
#SBATCH --partition=shared-EL7
#SBATCH --array=1
#SBATCH --output=slurm-%A_%a.out
#SBATCH --mem=8000

module load foss/2018b R/3.5.1

srun Rscript --verbose simuRRwithAttrition_multijob_test.R ${SLURM_ARRAY_TASK_I$




which is the sbatch I showed. I obtain:


slurmstepd: error: Detected 1 oom-kill event(s) in step 40617586.0 cgroup. Some$
srun: error: node013: task 7: Out Of Memory
srun: First task exited 30s ago
srun: step:40617586.0 tasks 2,4,6: running
srun: step:40617586.0 tasks 0-1,3,5,7: exited abnormally
srun: Terminating job step 40617586.0
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 40617586.0 ON node013 CANCELLED AT 2020-11-23T17:42$
slurmstepd: error: Detected 27 oom-kill event(s) in step 40617586.batch cgroup.$


whereas the same R script, lauched on debug with

sbatch simu_carrac_baobab_test.sh

works perfectly.
I do not understand why it launches several jobs on shared (and not on debug), and so crash the memory.

Hello,

indeed this seems to be a bug or a feature:) I could reproduce the issue myself in a simple script.

I’ve opened a ticket at schedmd.

Thanks for the feedback!

Yann

Do you have any idea how much time it will take, or any way of avoiding it ?
The simulation is the final run for a paper that need to be finished soon.

Thank you for your help

Try maybe without using the cpu per task option and just cpu if you need more than the default 1.

Yes, please use partition mono-shared-EL7 instead of shared-EL7 This should work as the parameters of this partition are the same as the debug-EL7.

ok I ll try that. I actually just need one, I set it to 2 with memory because of the memory problem I was facing

Perfect ! I ll do that, thanks a bunch

It worked like a charm.
Merci l’équipe, super service et réactivité, comme d’hab.