Preventing the OOM killer

Michael.Sonner · June 22, 2023, 12:03pm

Dear hpc-community,
Recently I ran a lot of interactive computations with numpy on the cluster and I frequently encounter out of memory errors. If I allocate an array which is vastly larger then the available RAM numpy raises a MemoryError:

In [3]: while 1:
   ...:     ar=np.ones((2**36))
   ...:     tlist.append(ar)
   ...:
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Input In [3], in <cell line: 1>()
      1 while 1:
----> 2     ar=np.ones((2**36))
      3     tlist.append(ar)

File ~/env/lib/python3.10/site-packages/numpy/core/numeric.py:205, in ones(shape, dtype, order, like)
    202 if like is not None:
    203     return _ones_with_like(shape, dtype=dtype, order=order, like=like)
--> 205 a = empty(shape, dtype, order)
    206 multiarray.copyto(a, 1, casting='unsafe')
    207 return a

MemoryError: Unable to allocate 512. GiB for an array with shape (68719476736,) and data type float64

In [4]:

This behavior is perfect, since it allows me to go back and change some parameters to make it smaller, clean up some unused arrays, and/or figure out what went wrong without restarting the session.

However if the allocation is only say 1GB per array, the process gets killed by the oom-killer instead:

In [5]: while 1:
   ...:     ar=np.ones((2**27))
   ...:     tlist.append(ar)
   ...:
fish: Job 1, 'ipython' terminated by signal SIGKILL (Forced quit)

This is very bad, because now everything in the current session is lost, open hdf5 files are potentially corrupted, and it also doesn’t say where the problematic large allocation happened which makes debugging harder.

As far as I am aware this happens since linux overcommits memory by default, i.e. allows allocating more memory than it has available instead of failing allocations.
Is there a way to disable this “feature” for just a single process?
If not, say I am willing to patch numpy or write a low level library, is there a way to tell linux to be “honest” about whether there is enough memory available? It would make sense to me if the MAP_POPULATE flag would do this, but I wasn’t able to find any documentation that it actually does.
Am I overlooking some other simple solution?

Cheers and Thanks a lot,
Michael

Adrien.Albert · June 22, 2023, 4:11pm

Hi @Michael.Sonner

Could you give your sbatch file ?

Michael.Sonner · June 22, 2023, 4:17pm

Hi,
For interactive work i usually use srun, in this case:

srun --time 4-00:00:0 --partition private-dpt-cpu,public-cpu --cpus-per-task 8 --mem 75G --pty bash

Yann.Sagon · June 26, 2023, 11:55am

Hi @Michael.Sonner

when you request resources with Slurm, it set up a cgroup which limits the memory and cpu resources for your job. I don’t know if numpy is aware of the cgroup mechanism of if it is is just looking at the memory available on the system. If the later, numpy will try to allocate more than what is permitted and the OOM will kill the process. I tried to google about that with no luck. Maybe you should check on the numpy forum?

Michael.Sonner · June 27, 2023, 8:18pm

Hi Yann,
It turns out that the solution is to use the setrlimit syscall to reduce the limit on the address space to the memory requested by slurm. This can be done for example via ulimit -v in the shell or resource.setrlimit(...) in python. Then numpy will raise a MemoryError even for smaller allocation once they violate these resource limits instead of getting killed by the OOM killer. For now I am just missing a way of automatically setting the rlimits to the cgroup limits, but that doesn’t seem too hard to do.

For future reference here a more complete explanation:
Generally [1], Linux does not actually reserve memory, check the cgroup limits or check if there is still memory available when told to allocate memory via the brk, mmap or mremap syscall. Instead the allocation will succeed in most cases. Memory is only really reserved once it is actually used by a process. Only if the actually used memory exceeds the cgroup limit, the oom killer shoots down one of the processes without any possibility of recovery. However, limits set by setrlimit do get checked on the allocation syscalls and ENOMEM is returned in case an allocation would violate the rlimits. In the case of numpy this gets correctly converted into a MemoryError, allowing me to continue the session

[1] Depending on the vm.overcommit_memory setting (though it apparently will always ignore cgroups), flags used in the syscall and probably a lot of other things

Cheers,
Michael

Yann.Sagon · June 28, 2023, 9:38am

Excellent,

That’s right, but in our case, Slurm ensures that the reservation is effective (ie we don’t allow memory oversubscription https://slurm.schedmd.com/cons_res_share.html#memory_management)

I’ve opened an issue at schedmd, maybe they have a native solution.

Yann.Sagon · July 4, 2023, 2:01pm

@Michael.Sonner

please check the answer from SchedMD.

Best

Yann

Yann.Sagon · July 4, 2023, 2:03pm

You should use salloc instead of srun --pty bash. See our doc