Dear all,
I’m trying to run a high resolution climate simulation (with a lower spatial resolution, it worked).
I get this error:
slurmstepd: error: Detected 1 oom-kill event(s) in step 24388146.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
I tried:
#SBATCH --mem-per-cpu=3000
or
#SBATCH --mem=3000
But nothing works…
Do you have any pointers?
Thank you!
You can try to put higher number in the --mem-per-cpu or mem command. Increase as much as needed.
You can also try to put a memory size that is too big and ssh to the node where the job is running and check memory usage with htop. Note that putting a higher memory requirement will limit which node can be used as the job will only run when the memory requirement are fullfill.
Hello, please show your full sbatch script.
You can check here how to determine how murch memory was used by your job.
https://baobab.unige.ch/enduser/src/enduser/submit.html#memory-and-cpu-usage
That’s right, by default each job has 3GB per core, the same as Emeline was requesting.
Thanks for your quick response!
Here’s my submit.sh script, which I launch as “sbatch submit.sh”
#!/bin/bash
#
#SBATCH --job-name=Aquaplanet
#SBATCH --output=aqua.txt
#
#SBATCH --partition=debug-EL7
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:14:00
#SBATCH --mem-per-cpu=3000
env | grep SLURM
data_dir=/home/bolmonte/scratch/20191209_Formation_LMDZ
ulimit -Ss unlimited
./gcm_96x95x39_phylmd_seq_orch.e
I tried to re-launch it to do sstat
(it runs for a very short time):
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
24412661 debug-EL7 Aquaplan bolmonte R 0:02 1 node001
$ sstat --format=AveCPU,MaxRSS,JobID,NodeList -j 24412661
AveCPU MaxRSS JobID Nodelist
---------- ---------- ------------ --------------------
00:01.000 498412K 24412661.ba+ node001
I also executed sreport
:
$ sreport job sizesbyaccount user=bolmonte PrintJobCount start=2019-01-01 end=2019-12-31
--------------------------------------------------------------------------------
Job Sizes 2019-01-01T00:00:00 - 2019-12-10T16:59:59 (29696400 secs)
Units are in number of jobs ran
--------------------------------------------------------------------------------
Cluster Account 0-49 CPUs 50-249 CPUs 250-499 CPUs 500-999 CPUs >= 1000 CPUs % of cluster
--------- --------- ------------- ------------- ------------- ------------- ------------- ------------
baobab root 533 0 0 0 0 100.00%
The error I get is still:
/var/spool/slurmd/job24412661/slurm_script: line 16: 119022 Killed ./gcm_96x95x39_phylmd_seq_orch.e
slurmstepd: error: Detected 1 oom-kill event(s) in step 24412661.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
Thanks again for your help!
Emeline
edit: I updated the post with code formatting for readability.
Try to increase --mem-per-cpu to something like 16000 and if its not enough increase the memory to 32000 and even 64000. When it does not crash check the real amount of memory used in the report.
You could also get this message if you have errors in your code.
For example, a division error etc.
Hello
To request memory you can indeed use --mem-per-cpu
but this is more useful when you use more than one cpu, which is not your case. In your case, it’s better to use --mem=XG
which don’t depend on the number of allocated cpu. As I said, the value of X
is by default 3. Try as said Pablo to increase this value (for example 6G) and relaunch the job. If it doesn’t crash, check the real memory consumption and adapt this value to the maximum RSS you reached with a good margin (1G for example).
You should (or maybe you have a good reason not to do it) prefix this command with srun
:
srun ./gcm_96x95x39_phylmd_seq_orch.e
Thank you for your help!
Increasing the -mem did the trick:
bolmonte@login2:~/LMDZ_Formation/LMDZ2019/modipsl/modeles/LMDZ/AQUAPLANET_highres$ squ
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
24445021 debug-EL7 Aquaplan bolmonte R 0:17 1 node001
bolmonte@login2:~/LMDZ_Formation/LMDZ2019/modipsl/modeles/LMDZ/AQUAPLANET_highres$ sstat --format=AveCPU,MaxRSS,JobID,NodeList -j 24445021
AveCPU MaxRSS JobID Nodelist
---------- ---------- ------------ --------------------
00:15.000 4227064K 24445021.ba+ node001
I also added the srun thing before ./gcm, and somehow it decreased a lot the memory used…:
bolmonte@login2:~/LMDZ_Formation/LMDZ2019/modipsl/modeles/LMDZ/AQUAPLANET_highres$ squ
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
24445171 debug-EL7 Aquaplan bolmonte R 3:24 1 node001
bolmonte@login2:~/LMDZ_Formation/LMDZ2019/modipsl/modeles/LMDZ/AQUAPLANET_highres$ sstat --format=AveCPU,MaxRSS,JobID,NodeList -j 24445171
AveCPU MaxRSS JobID Nodelist
---------- ---------- ------------ --------------------
00:00.000 8160K 24445171.ba+ node001
So that in the end, I guess I don’t need to increase the memory allocation anymore… Is that normal?
Thank you for your patience!