Job behaviour difference in interactive session and submitting to queue

I am trying to run a processing script over some AOD files which are on my scratch. To test things if it works, I first try to run it in an interactive session, and things run okay. I usually request a 10GB and 1 hour session. I have checked the memory profile with htop of the script and it never crosses 3 GB during the run.

However when I run with a submission script, and repeat the same - I see the job easily crossing 20 GB. These are the steps I follow to run the jobs.

Interactive session:

setupATLAS, source, compile package, run the script (for 1 AOD file)

Submission script:

#!/bin/bash
#SBATCH --job-name=None
#SBATCH --cpus-per-task=1
#SBATCH --time=50:00
#SBATCH --mail-user=debajyoti.sengupta@unige.ch
#SBATCH --mail-type=FAIL
#SBATCH --partition=private-dpnc-cpu,shared-cpu
#SBATCH --output=/home/users/s/senguptd/atlas/idvpm/joblogs/slurm-%A-%x_%a.out
#SBATCH --chdir=/home/users/s/senguptd/atlas/idvpm/
#SBATCH --mem=20GB
#SBATCH -a 0-1
export XDG_RUNTIME_DIR=""
_task_number=${SLURM_ARRAY_TASK_ID:-"taskid"}

echo "On slurm node : $SLURMD_NODENAME"
export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
PKG_SETUP=${ATLAS_LOCAL_ROOT_BASE}/packageSetups
. ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh
cd build/
asetup --restore
. x86_64-centos7-gcc8-opt/setup.sh

aodfiles=(/srv/beegfs/scratch/users/s/senguptd/atlas/data/mega/JZ2W/wPU/user.pagacova.JZ2W_wPU_300122_EXT0/user.pagacova.27982040.EXT0._000001.AOD.pool.root /srv/beegfs/scratch/users/s/senguptd/atlas/data/mega/JZ2W/wPU/user.pagacova.JZ2W_wPU_300122_EXT0/user.pagacova.27982040.EXT0._000002.AOD.pool.root)
opdir=/srv/beegfs/scratch/users/s/senguptd/atlas/data/mega/JZ2W/wPU

athena.py --filesInput ${aodfiles[$_task_number]} InDetPhysValMonitoring/InDetPhysValMonitoring_topOptions.py - --setCSVName ${_task_number}.csv
mv ${_task_number}.csv ${opdir}

There are usually O(100) or so AOD files, but I am pasting 2 here for brevity. When I test the script I usually run with the --array flag anyway.

Any idea what’s going wrong? Please also let me know what other information you would require to help debug this.

Regards,
Deb

Hi,

what do you mean by Interactive session:

  • you run directly your job on the login node (bad)
  • you request resources and run on them like explained on our doc? hpc:slurm [eResearch Doc]

Did you checked the memory usage of one of your job hpc:slurm [eResearch Doc]

exactly, I request a cpu node and run the job

This is for the job that fails:

[senguptd@node277.baobab run]$ sacct --format=Start,AveCPU,State,MaxRSS,JobID,NodeList,ReqMem --units=G -j 55994891_0
              Start     AveCPU      State     MaxRSS JobID               NodeList     ReqMem 
------------------- ---------- ---------- ---------- ------------ --------------- ---------- 
2022-03-04T18:08:41            OUT_OF_ME+            55994891_0           node277        20G 
2022-03-04T18:08:41   00:00:14 OUT_OF_ME+      7.40G 55994891_0.+         node277            
2022-03-04T18:08:41   00:00:00  COMPLETED      0.00G 55994891_0.+         node277   

Hi Yann, any ideas regarding what might be going wrong here?

Hello again,

I know things are a bit busy with getting baobab back online properly for everyone, but I was wondering if there were any updates on this one. I can’t seem to figure out what the difference is between the two setups. This, at the moment, is a big bottleneck in my workflow.
Any help is appreciated.

Hi @Debajyoti.Sengupta we had an issue with memory enforcement since some weeks. We corrected the Slurm config.

Do you still have the issue?

Best

Yann

Yes, I just checked and it still gets killed for out of memory issues. This was on CPU228.

Hi,
I ran into the same issue with a gpu job. I requested 9GB but ran into OOM at just above 4GB:

sacct --format=Start,AveCPU,State,MaxRSS,JobID,NodeList,ReqMem --units=G -j 56404445
              Start     AveCPU      State     MaxRSS JobID            NodeList     ReqMem 
------------------- ---------- ---------- ---------- ------------ ------------- ---------- 
2022-04-01T15:45:59            OUT_OF_ME+            56404445           gpu009      8.79G 
2022-04-01T15:45:59   00:00:00 OUT_OF_ME+      0.01G 56404445.ba+       gpu009            
2022-04-01T15:45:59   00:00:00 OUT_OF_ME+      0.00G 56404445.ex+       gpu009            
2022-04-01T15:46:04   00:00:51 OUT_OF_ME+      4.19G 56404445.0         gpu009     

Hi again,

Looks like multiple users have a similar problem. It still could be that sacct is slow to update and doesn’t show that it indeed goes out of memory (not sure, guessing here). But regardless it’s bizarre that jobs would run fine interactively (on a node requested via salloc) without blowing up in memory but fail in batch mode.

Are there differences between the two?

Hi, quick answer: it shouldn’t have any difference. We’ll do some tests from our side to see if we can reproduce the issue.

Hi Yann,

I was wondering if you had a chance to look into this matter since the last update?

Hi,

I tried with stressapptest.

I submited a sbatch script, requesting the default (3G - 5%).

First run: I requested 5G with stressapptest:

[sagon@login2.baobab stress]$ sacct --format=Start,AveCPU,State,MaxRSS,JobID,NodeList,ReqMem --units=G -j 56799935
              Start     AveCPU      State     MaxRSS JobID               NodeList     ReqMem
------------------- ---------- ---------- ---------- ------------ --------------- ----------
2022-04-13T11:51:56            OUT_OF_ME+            56799935              cpu001      2.93G
2022-04-13T11:51:56   00:00:00 OUT_OF_ME+      0.00G 56799935.ba+          cpu001
2022-04-13T11:51:56   00:00:00 OUT_OF_ME+      0.00G 56799935.ex+          cpu001
2022-04-13T11:51:57   00:00:00 OUT_OF_ME+      0.00G 56799935.0            cpu001

As you can see, the job is killed before Slurm notice any memory consumption. Indeed, Slurm is polling the memory consumption, but the mechanism to kill the job is cgroup.

I then asked for 2G in stressapptest and the job could finish correctly.

[sagon@login2.baobab stress]$ sacct --format=Start,AveCPU,State,MaxRSS,JobID,NodeList,ReqMem --units=G -j 56800280
              Start     AveCPU      State     MaxRSS JobID               NodeList     ReqMem
------------------- ---------- ---------- ---------- ------------ --------------- ----------
2022-04-13T12:01:35             COMPLETED            56800280              cpu001      2.93G
2022-04-13T12:01:35   00:00:00  COMPLETED      0.00G 56800280.ba+          cpu001
2022-04-13T12:01:35   00:00:00  COMPLETED      0.00G 56800280.ex+          cpu001
2022-04-13T12:01:35   00:00:00  COMPLETED      0.00G 56800280.0            cpu001

As the job runtime is small, the memory usage isn’t seen as well.

For me, it works as expected.

Is the suggestion to lower the memory and wall time requested?

Hi,

no, the point of my post was to show Slurm and memory control are working as expected.

If your job is killed because out of memory, then requests more memory.

Hi, can you please show us a complete terminal capture of how you are running your job using salloc?

Sorry it took a while to get to this, following is the pastebin link to the terminal log: pastebin (I couldn’t upload a .log file).

Useful info: I use one alias called getcpu which is nothing but:

getcpu () {
        if [ $# -eq 0 ]
                then
                        echo "No time or memory requirements passed. Starting interactive CPU session with 1 hour and 10G of memory."
                        salloc -n1 --partition=private-dpnc-cpu,shared-cpu  --time=1:00:00 --mem=10G srun -n1 -N1 --pty $SHELL
                else
                        echo "Requesting CPU session with $1 wall time and $2 memory."
                        salloc -n1 --partition=private-dpnc-cpu,shared-cpu  --time=$1 --mem=$2 srun -n1 -N1 --pty $SHELL
        fi
}

The whole script then runs fine with max memory usage ~ 2-3 GB.
Let me know if you need more info.

Hi Yann,
Do you see something obvious that might be causing the difference?