Dear hpc-community,
Recently I ran a lot of interactive computations with numpy on the cluster and I frequently encounter out of memory errors. If I allocate an array which is vastly larger then the available RAM numpy raises a MemoryError:
In [3]: while 1:
...: ar=np.ones((2**36))
...: tlist.append(ar)
...:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
Input In [3], in <cell line: 1>()
1 while 1:
----> 2 ar=np.ones((2**36))
3 tlist.append(ar)
File ~/env/lib/python3.10/site-packages/numpy/core/numeric.py:205, in ones(shape, dtype, order, like)
202 if like is not None:
203 return _ones_with_like(shape, dtype=dtype, order=order, like=like)
--> 205 a = empty(shape, dtype, order)
206 multiarray.copyto(a, 1, casting='unsafe')
207 return a
MemoryError: Unable to allocate 512. GiB for an array with shape (68719476736,) and data type float64
In [4]:
This behavior is perfect, since it allows me to go back and change some parameters to make it smaller, clean up some unused arrays, and/or figure out what went wrong without restarting the session.
However if the allocation is only say 1GB per array, the process gets killed by the oom-killer instead:
In [5]: while 1:
...: ar=np.ones((2**27))
...: tlist.append(ar)
...:
fish: Job 1, 'ipython' terminated by signal SIGKILL (Forced quit)
This is very bad, because now everything in the current session is lost, open hdf5 files are potentially corrupted, and it also doesn’t say where the problematic large allocation happened which makes debugging harder.
As far as I am aware this happens since linux overcommits memory by default, i.e. allows allocating more memory than it has available instead of failing allocations.
Is there a way to disable this “feature” for just a single process?
If not, say I am willing to patch numpy or write a low level library, is there a way to tell linux to be “honest” about whether there is enough memory available? It would make sense to me if the MAP_POPULATE flag would do this, but I wasn’t able to find any documentation that it actually does.
Am I overlooking some other simple solution?
Cheers and Thanks a lot,
Michael