Srun error gpu006 task 6 Out Of Memory

Jingze.Duan · January 19, 2024, 1:51pm

If you are asking for help, try to provide information that can help us solve your issue, such as :

what did you try:
My job stopped after run only about 1h on baobab.
JobID: 6793261

what was the error message: (from .out file)
error: Detected 1 oom_kill event in StepId=6793261.0. Some of the step tasks have been OOM Killed.
srun: error: gpu006: task 6: Out Of Memory

the batchfile:

#!/bin/bash
#SBATCH --job-name="*"
#SBATCH --mail-type=ALL
#SBATCH --mail-user=*
#SBATCH --time=12:00:00
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=1
#SBATCH --partition=shared-gpu
#SBATCH --gpus=1
#========================================
module load GCC/11.3.0
module load OpenMPI/4.1.4
module load GROMACS/2023.1-CUDA-11.7.0
export OMP_NUM_THREADS=8
srun gmx_mpi mdrun -deffnm push -s push.tpr -v -pin on -nb gpu -cpi push.cpt -noappend

Adrien.Albert · January 19, 2024, 2:06pm

@Jingze.Duan

It seems you forget to specify the memory you need for your job. By default slurm will assign 3 GB of memory.

Jingze.Duan · January 19, 2024, 2:09pm

Thank you for your reply. But the output of my job was no more than 200 MB data for running 1h.
Btw, how can I specify the memory?

Adrien.Albert · January 19, 2024, 2:30pm

Hi,

Is your log trace a real-time resource tracking?

From slurm documentation : Slurm Workload Manager - sbatch

–mem=<size>[units]
Specify the real memory required per node. Default units are megabytes. Different units can be specified using the suffix [K|M|G|T]. Default value is DefMemPerNode and the maximum value is MaxMemPerNode. If configured, both parameters can be seen using the scontrol show config command. This parameter would generally be used if whole nodes are allocated to jobs (SelectType=select/linear). Also see –mem-per-cpu and –mem-per-gpu. The –mem, –mem-per-cpu and –mem-per-gpu options are mutually exclusive. If –mem, –mem-per-cpu or –mem-per-gpu are specified as command line arguments, then they will take precedence over the environment.

NOTE: A memory size specification of zero is treated as a special case and grants the job access to all of the memory on each node.

Ludovic.Dumoulin · January 29, 2024, 3:49pm

Hello,

I have similar issue since the update to 1.9.3 from 1.8.x.
It seems that it is coming from the GC of julia. This problem is solved with the new version 1.10 (from december 2023).
I’ll ask for the update of julia to 1.10.

Thank you,
Best regards

EDIT: you can manually call the GC using GC.gc()

Yann.Sagon · January 30, 2024, 2:04pm

Hi, please try the new version we just installed: New software installed: Julia version 1.10.0