Blog

Using Memory in Python

June 04, 2024

When you think about memory usage, usually you're trying to reduce it. CPython, malloc, and your kernel all helpfully try to help you use less memory. However, I found myself fighting against all 3 of those because I was trying to use more memory.

The context manager

At work, we have a context manager that measures how much memory was used via psutil.Process.memory_info(). It looks roughly like this:

import contextlib
@contextlib.contextmanager
def measure_memory():
    proc = psutil.Process()
    start_rss = proc.memory_info().rss
    m = {'rss_mib': None}
    yield m
    end_rss = proc.memory_info().rss
    m['rss_mib'] = (end_rss - start_rss) / 1024 / 1024

This is more performant than tracemalloc and works with C extensions.

The troubles

But then we wrote a test:

def test_measure_memory():
    with measure_memory() as m:
        bytearray(1024 * 1024)
    assert m['rss_mib'] >= 1.0

This should allocate a 1 MiB bytearray, but it usually doesn't. Adding a print(m) to the above code, I get

$ python3 -V
Python 3.11.8
$ ./test.py
{'rss_mib': 0.1484375}

$ podman run -it --rm -v `pwd`:/usr/src -w /usr/src python:3.10-slim \
    sh -c 'pip install psutil && ./test.py'
{'rss_mib': 0.296875}

$ podman run -it --rm -v `pwd`:/usr/src -w /usr/src python:3.9-slim \
    sh -c 'pip install psutil && . /test.py'
{'rss_mib': 0.09375}

So we allocated some memory, but a lot less than 1 MiB.

tracemalloc

We can ask python what's going on. In the context manager,

import tracemalloc
@contextlib.contextmanager
def measure_memory():
    proc = psutil.Process()
    start_rss = proc.memory_info().rss
    m = {'rss_mib': None}
    tracemalloc.start() # start right before yielding
    yield m
    end_rss = proc.memory_info().rss
    for t in tracemalloc.take_snapshot().traces: # show allocations
        print(t)
    m['rss_mib'] = (end_rss - start_rss) / 1024 / 1024

This prints

[...]/site-packages/psutil/_pslinux.py:1969: 32 B

which means test_measure_memory didn't allocate any memory! The only allocation came from psutil.

Adjusting our code to assign the bytearray to a variable changes the behavior.

def test_measure_memory():
    with measure_memory() as m:
        _ = bytearray(1024 * 1024)
    assert m['rss_mib'] >= 1.0

Now, this prints

[...]/test.py:22: 1024 KiB
[...]/site-packages/psutil/_pslinux.py:1969: 32 B

so we're at least in business.

Unfortunately, this test is flaky. It sometimes doesn't allocate 1 MiB. We tried to change the test to allocate 100× more memory

def test_measure_memory():
    with measure_memory() as m:
        _ = bytearray(100 * 1024 * 1024) # 100×
    assert m['rss_mib'] > 1.0

but that reduced the flakiness rather than getting rid of it.

Holes

The problem is that the allocator may re-use previously freed memory. We can observe this in action if we do another big allocation beforehand to simulate some other tests running before ours:

def test_measure_memory():
    _ = bytearray(1024 * 1024) # make a hole
    with measure_memory() as m:
        _ = bytearray(1024 * 1024) # line 23
    print(m)

This prints

[...]/test.py:23: 1024 KiB
[...]/site-packages/psutil/_pslinux.py:1969: 32 B
[...]/test.py:23: 56 B
{'rss_mib': 0.37890625}

which shows my second bytearray on line 23 triggering a 1 MiB allocation in CPython but only 388 KiB of RSS usage inside the context manager. We can make this problem worse by making 2 objects and then triggering garbage collection:

def test_measure_memory():
    _ = bytearray(1024 * 1024)
    del _
    _ = bytearray(1024 * 1024)
    del _
    import gc; gc.collect()
    with measure_memory() as m:
        _ = bytearray(1024 * 1024) # line 27
    print(m)

Now, we get

[...]/psutil/_pslinux.py:1969: 32 B
[...]/test.py:27: 56 B
[...]/psutil/_common.py:788: 120 B
[...]/test.py:27: 1024 KiB
{'rss_mib': 0.0}

So we're still doing 1 MiB according to tracemalloc but we increased RSS by precisely 0. Garbage collecting the first two bytearrays created a hole that allowed the allocator to not allocate new memory for the last bytearray. The flakiness of this test is especially bad because it's more likely to fail when more tests have run before it (more 1-MiB-sized holes).

mmap

It seems like our fundamental problem is that bytearray is allocated by the CPython object allocator. Since it's giving us so much trouble, can we bypass it by allocating memory some other way? We could certainly write/find a C extension that allocates memory directly. But python has an mmap module which sounds very promising.

import mmap
def test_measure_memory():
    with measure_memory() as mem:
        mm = mmap.mmap(-1, 1024 * 1024,
            mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS)
        # copy the first half to the second half,
        # writing to every page in the second half
        mm.move(512 * 1024, 0, 512 * 1024)
        # copy the second half to the first half
        mm.move(0, 512 * 1024, 512 * 1024)
    mm.close()
    print(mem)

Does it work?

{'rss_mib': 1.0}

Cool, we used exactly 1 MiB of RAM! We bypassed the python object allocator and were able to directly allocate memory ourselves!

Let's try making a hole in memory again:

def test_measure_memory():
    _ = bytearray(10 * 1024 * 1024) # hole
    del _
    with measure_memory() as mem:
        mm = mmap.mmap(-1, 1024 * 1024,
            mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS)
        mm.move(512 * 1024, 0, 512 * 1024)
        mm.move(0, 512 * 1024, 512 * 1024)
    mm.close()
    print(mem)

Now it outputs

{'rss_mib': 0.875}

which doesn't look good. It turns out this version of the test is still flaky.

Is it holes again?

My original theory for why this was happening was that we were still filling holes in memory. If we inspect what happened in our program with strace, we see

$ strace -e mmap,munmap python3 test.py
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b7aa75000
mmap(NULL, 87539, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f4b7aa5f000
mmap(NULL, 921624, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f4b7a97d000
mmap(0x7f4b7a98d000, 483328, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x10000) = 0x7f4b7a98d000
mmap(0x7f4b7aa03000, 368640, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x86000) = 0x7f4b7aa03000
mmap(0x7f4b7aa5d000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xdf000) = 0x7f4b7aa5d000

This continues for a few screens. So there were apparently a ton of mmaps that happened, even though we only called it once. Why? By default, CPython uses the pymalloc allocator.

The arena allocator uses the following functions: [...] mmap() and munmap() if available

So other objects are allocated via mmap! Also,

It uses memory mappings called “arenas” with a fixed size [...]. It falls back to PyMem_RawMalloc() [...] for allocations larger than 512 bytes.

PyMem_RawMalloc() is a wrapper for malloc which is itself an arena allocator:

the GNU C Library malloc implementation maintains multiple such areas [...]. Each such area is internally referred to as an arena.

It's arena allocators all the way down. But surely, our 1 MiB allocation and the pymalloc arenas are too large to fit in a glibc malloc arena?

The other way of memory allocation is for very large blocks, i.e. much larger than a page. These requests are allocated with mmap

So, to summarize

pymalloc allocates arenas with mmap and large objects with glibc malloc
glibc malloc allocates large objects with mmap
oh, by the way, glibc malloc extends the main arena with sbrk, but it allocates new arenas with... mmap

The only heap allocations happening in CPython that aren't mmap-ed are small, non-Python objects.

My holes theory was that previous allocations were mmaping over the same address that we were mmaping. When the original allocations were freed, the kernel didn't actually bother decreasing the RSS. So when we came along, we didn't use up the full 1 MiB of RSS.

Under this theory, I came up with a neat trick for using 1 MiB of memory. Reserve a chunk in the address space, then un-use memory without releasing the reserved chunk in the address space via MADV_DONTNEED:

The madvise() system call is used to give advice or directions to the kernel about the address range beginning at address addr and with size length.

I love giving advice or directions to the kernel!

The advice values listed below allow an application to tell the kernel how it expects to use some mapped or shared memory areas

Perfect, I just mapped a memory area.

MADV_DONTNEED
Do not expect access in the near future. After a successful MADV_DONTNEED operation, the semantics of memory access in the specified region are changed: subsequent accesses of pages in the range will succeed, but will result in [...] zero-fill-on-demand pages for anonymous private mappings.

The kernel is free to delay freeing the pages until an appropriate moment. The resident set size (RSS) of the calling process will be immediately reduced however.

Wow, what a coincidence! We just happen to be measuring the RSS! This is a pretty crazy feature but, in conjuction with the previous trick, appears to reliably solve our problem.

def test_measure_memory():
    # 1. allocate 1 MiB of address space somewhere
    with mmap.mmap(-1, 1024 * 1024,
            mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS) as mm:
        # 2. free the underlying pages but hold onto the address space
        mm.madvise(mmap.MADV_DONTNEED)
        # 3. snapshot our memory usage (measure_memory.__enter__)
        with measure_memory() as mem:
            # 4. use memory
            mm.move(512 * 1024, 0, 512 * 1024)
            mm.move(0, 512 * 1024, 512 * 1024)
    assert mem['rss_mib'] > 0.0

By forcing an allocation at a specific address in step 4, I thought I was avoiding interference from previous allocations.

Unfortunately, it still increases the memory by less than 1 MiB, so we had to write the check as > 0.0. But the good news is this version of the test doesn't flake, at least!

Someone asked me if this test was worth writing... I replied it might be better to delete the test, but it'd more fun to keep it.

The plot thickens

However, it turns out this theory is incorrect. mmaping and writing to memory immediately allocates the memory; unmapping immediately frees the memory. According to my friend Ricky, what we were seeing was actually a cached RSS value.

psutil.Process.memory_info() reads /proc/pid/statm. statm is known to be inaccurate:

(2) resident set size
(inaccurate; same as VmRSS in /proc/pid/status)

OK, let's go read about VmRSS.

Resident set size. Note that the value here is the sum of RssAnon, RssFile, and RssShmem. This value is inaccurate; see /proc/pid/statm above.

Uh, that's not very helpful. Let's go read some code. Here's a discussion thread about adding "inaccurate" to the docs.

Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in v2.6.34), Linux uses per-thread RSS counters to reduce cache contention on the per-mm counters. With a 4K page size, that means that you can end up with the counters off by up to 252KiB per thread.

It then talks about SPLIT_RSS_COUNTING, but that actually has since been removed:

This patch converts the rss_stats into percpu_counter to convert the error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).

drgn

In order to see if this is what's happening, we need to get the per-CPU RSS counters. As far as I know, this isn't exposed in some handy /proc file, so we'll just have to ask the kernel directly. Conveniently, drgn helps us do just that!

First, let's tweak our code to slow down.

def test_measure_memory():
    print('pid', os.getpid())
    _ = bytearray(10 * 1024 * 1024)
    del _
    with mmap.mmap(-1, 1024 * 1024, mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS) as mm:
        #mm.madvise(mmap.MADV_DONTNEED)
        with measure_memory() as mem:
            input('press enter to use memory')
            mm.move(512 * 1024, 0, 512 * 1024)
            mm.move(0, 512 * 1024, 512 * 1024)
        input('memory used! press enter to continue')
    print(mem)

Now, we fire up drgn.

$ pip3 install drgn
$ sudo apt install linux-image-amd64-dbg
$ sudo venv/bin/drgn
>>> counters = find_task(610602).mm.rss_stat
(struct percpu_counter [4]){
[...]

OK, we have an array of 4 percpu_counters, just like the last patch said. It's an array of these page counters. We care about MM_ANONPAGES because we're using an anonymous mapping, so we want index 1.

>>> counters[1]
(struct percpu_counter){
    .lock = [...]
    .count = (s64)1709,
    .list = (struct list_head){
        .next = (struct list_head *)0xffff9d720bb35048,
        .prev = (struct list_head *)0xffff9d720bb35098,
    },
    .counters = (s32 *)0x3da7f0c20ad0,
}

We already see the cached value that statm/psutil report:

>>> counters[1].count
(s64)1709

How do we get the per-CPU values, though? Conveniently, when we loaded up drgn, it automatically imported a bunch of helpers, including percpu_counter_sum.

>>> percpu_counter_sum(counters[1])
1709

Cool, these are the same number! Now, I hit enter to use memory in my python program and check the counters again.

>>> counters[1].count
(s64)1933
>>> percpu_counter_sum(counters[1])
1965

Great, the numbers are different! These are measured in pages. To convert to bytes, we multiply by my page size of 4096. So the cached/statm number is 1933 × 4096 = 7.551 MiB and the sum of the per-CPU counters gives us 1965 × 4096 = 7.676 MiB.

Perhaps more importantly, the difference is (1965 - 1933) × 4096 = 0.125 MiB, which is exactly the difference between what we were allocating above (1.0 MiB) and what we were seeing in statm (0.875 MiB)! The per-CPU counters also explain why the test flaked on our 128-core CI instances way more than on our laptops.

Accurate RSS

You may be wondering how to get accurate, uncached memory usage. Going back to the proc(5) manpage,

Some of these values are inaccurate because of a kernel-internal scalability optimization. If accurate values are required, use /proc/pid/smaps or /proc/pid/smaps_rollup instead, which are much slower but provide accurate, detailed information.

psutil parses that for you here inside memory_full_info.

Bonus

If you want to learn actually useful things about Linux memory management like how to save memory by not garbage collecting, read the bonus section at the end of my fork(2) blog post.