I am trying to run a processing script over some AOD files which are on my scratch. To test things if it works, I first try to run it in an interactive session, and things run okay. I usually request a 10GB and 1 hour session. I have checked the memory profile with htop of the script and it never crosses 3 GB during the run.
However when I run with a submission script, and repeat the same - I see the job easily crossing 20 GB. These are the steps I follow to run the jobs.
Interactive session:
setupATLAS, source, compile package, run the script (for 1 AOD file)
I know things are a bit busy with getting baobab back online properly for everyone, but I was wondering if there were any updates on this one. I can’t seem to figure out what the difference is between the two setups. This, at the moment, is a big bottleneck in my workflow.
Any help is appreciated.
Looks like multiple users have a similar problem. It still could be that sacct is slow to update and doesn’t show that it indeed goes out of memory (not sure, guessing here). But regardless it’s bizarre that jobs would run fine interactively (on a node requested via salloc) without blowing up in memory but fail in batch mode.
As you can see, the job is killed before Slurm notice any memory consumption. Indeed, Slurm is polling the memory consumption, but the mechanism to kill the job is cgroup.
I then asked for 2G in stressapptest and the job could finish correctly.
Sorry it took a while to get to this, following is the pastebin link to the terminal log: pastebin (I couldn’t upload a .log file).
Useful info: I use one alias called getcpu which is nothing but:
getcpu () {
if [ $# -eq 0 ]
then
echo "No time or memory requirements passed. Starting interactive CPU session with 1 hour and 10G of memory."
salloc -n1 --partition=private-dpnc-cpu,shared-cpu --time=1:00:00 --mem=10G srun -n1 -N1 --pty $SHELL
else
echo "Requesting CPU session with $1 wall time and $2 memory."
salloc -n1 --partition=private-dpnc-cpu,shared-cpu --time=$1 --mem=$2 srun -n1 -N1 --pty $SHELL
fi
}
The whole script then runs fine with max memory usage ~ 2-3 GB.
Let me know if you need more info.