Slurmstepd: error: Detected 1 oom-kill event(s) in StepId=51911687.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler

what did you try:
I was trying to run a GPU-related neural network process for carrying out predictions using a trained model.

what didn’t work:
I tried using the private-schaer-gpu or shared-gpu partitions with the maximum memory per CPU (16000) that was available. However, I get the Out of Memory error. The process seems to work fine for the training process, but fails to finish the testing. I tried to reduce the testing dataset for my current case but I am getting the same OOM error.

what was the error message:
Id=51911687.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: gpu007: task 0: Out Of Memory
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=51911687.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

path to the relevant files (logs, sbatch script, etc):

  • logs: /home/share/schaer2/neural-network-f/neural_network_test/5_sec_NN/slurm-51911687.out
  • sbatch script: /home/share/schaer2/neural-network-f/neural_network_test/5_sec_NN/run_predict.sh

Hi,

I checked your job 51911687

[root@login2.cluster ~]# sacct --format=Start,AveCPU,State,MaxRSS,JobID,NodeList,ReqMem --units=G -j 51911687
              Start     AveCPU      State     MaxRSS        JobID        NodeList     ReqMem
------------------- ---------- ---------- ---------- ------------ --------------- ----------
2021-11-09T00:38:47            OUT_OF_ME+            51911687              gpu007     2.93Gc
2021-11-09T00:38:47   00:00:00 OUT_OF_ME+      0.01G 51911687.ba+          gpu007     2.93Gc
2021-11-09T00:38:47   00:00:00 OUT_OF_ME+      0.00G 51911687.ex+          gpu007     2.93Gc
2021-11-09T00:38:47   02:04:35 OUT_OF_ME+     44.73G 51911687.0            gpu007     2.93Gc

It seems you asked only 3GB per core. You probably changed your sbatch since your post as you are now specifying another partition, and asking 10GB. So to solve your issue, I suggest to ask for more memory.