Using Memory in Python
June 04, 2024
When you think about memory usage, usually you're trying to reduce it. CPython, malloc, and your kernel all helpfully try to help you use less memory. However, I found myself fighting against all 3 of those because I was trying to use more memory.
The context manager
At work, we have a context manager that measures how much memory was used via
psutil.Process.memory_info().
It looks roughly like this:
import contextlib
@contextlib.contextmanager
def measure_memory():
proc = psutil.Process()
start_rss = proc.memory_info().rss
m = {'rss_mib': None}
yield m
end_rss = proc.memory_info().rss
m['rss_mib'] = (end_rss - start_rss) / 1024 / 1024
This is more performant than tracemalloc and works with C extensions.
The troubles
But then we wrote a test:
def test_measure_memory():
with measure_memory() as m:
bytearray(1024 * 1024)
assert m['rss_mib'] >= 1.0
This should allocate a 1 MiB bytearray,
but it usually doesn't.
Adding a print(m) to the above code, I get
$ python3 -V
Python 3.11.8
$ ./test.py
{'rss_mib': 0.1484375}
$ podman run -it --rm -v `pwd`:/usr/src -w /usr/src python:3.10-slim \
sh -c 'pip install psutil && ./test.py'
{'rss_mib': 0.296875}
$ podman run -it --rm -v `pwd`:/usr/src -w /usr/src python:3.9-slim \
sh -c 'pip install psutil && . /test.py'
{'rss_mib': 0.09375}
So we allocated some memory, but a lot less than 1 MiB.
tracemalloc
We can ask python what's going on. In the context manager,
import tracemalloc
@contextlib.contextmanager
def measure_memory():
proc = psutil.Process()
start_rss = proc.memory_info().rss
m = {'rss_mib': None}
tracemalloc.start() # start right before yielding
yield m
end_rss = proc.memory_info().rss
for t in tracemalloc.take_snapshot().traces: # show allocations
print(t)
m['rss_mib'] = (end_rss - start_rss) / 1024 / 1024
This prints
[...]/site-packages/psutil/_pslinux.py:1969: 32 B
which means test_measure_memory didn't allocate any memory!
The only allocation came from psutil.
Adjusting our code to assign the bytearray to a variable changes the behavior.
def test_measure_memory():
with measure_memory() as m:
_ = bytearray(1024 * 1024)
assert m['rss_mib'] >= 1.0
Now, this prints
[...]/test.py:22: 1024 KiB
[...]/site-packages/psutil/_pslinux.py:1969: 32 B
so we're at least in business.
Unfortunately, this test is flaky. It sometimes doesn't allocate 1 MiB. We tried to change the test to allocate 100× more memory
def test_measure_memory():
with measure_memory() as m:
_ = bytearray(100 * 1024 * 1024) # 100×
assert m['rss_mib'] > 1.0
but that reduced the flakiness rather than getting rid of it.
Holes
The problem is that the allocator may re-use previously freed memory. We can observe this in action if we do another big allocation beforehand to simulate some other tests running before ours:
def test_measure_memory():
_ = bytearray(1024 * 1024) # make a hole
with measure_memory() as m:
_ = bytearray(1024 * 1024) # line 23
print(m)
This prints
[...]/test.py:23: 1024 KiB
[...]/site-packages/psutil/_pslinux.py:1969: 32 B
[...]/test.py:23: 56 B
{'rss_mib': 0.37890625}
which shows my second bytearray on line 23 triggering a 1 MiB allocation in CPython
but only 388 KiB of RSS usage inside the context manager.
We can make this problem worse by making 2 objects and then triggering garbage collection:
def test_measure_memory():
_ = bytearray(1024 * 1024)
del _
_ = bytearray(1024 * 1024)
del _
import gc; gc.collect()
with measure_memory() as m:
_ = bytearray(1024 * 1024) # line 27
print(m)
Now, we get
[...]/psutil/_pslinux.py:1969: 32 B
[...]/test.py:27: 56 B
[...]/psutil/_common.py:788: 120 B
[...]/test.py:27: 1024 KiB
{'rss_mib': 0.0}
So we're still doing 1 MiB according to tracemalloc but we increased RSS by precisely 0.
Garbage collecting the first two bytearrays created a hole that allowed the allocator
to not allocate new memory for the last bytearray.
The flakiness of this test is especially bad because it's more likely to fail
when more tests have run before it (more 1-MiB-sized holes).
mmap
It seems like our fundamental problem is that bytearray is allocated by the CPython object allocator.
Since it's giving us so much trouble, can we bypass it by allocating memory some other way?
We could certainly write/find a C extension that allocates memory directly.
But python has an mmap module which sounds very promising.
import mmap
def test_measure_memory():
with measure_memory() as mem:
mm = mmap.mmap(-1, 1024 * 1024,
mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS)
# copy the first half to the second half,
# writing to every page in the second half
mm.move(512 * 1024, 0, 512 * 1024)
# copy the second half to the first half
mm.move(0, 512 * 1024, 512 * 1024)
mm.close()
print(mem)
Does it work?
{'rss_mib': 1.0}
Cool, we used exactly 1 MiB of RAM! We bypassed the python object allocator and were able to directly allocate memory ourselves!
Let's try making a hole in memory again:
def test_measure_memory():
_ = bytearray(10 * 1024 * 1024) # hole
del _
with measure_memory() as mem:
mm = mmap.mmap(-1, 1024 * 1024,
mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS)
mm.move(512 * 1024, 0, 512 * 1024)
mm.move(0, 512 * 1024, 512 * 1024)
mm.close()
print(mem)
Now it outputs
{'rss_mib': 0.875}
which doesn't look good. It turns out this version of the test is still flaky.
Is it holes again?
My original theory for why this was happening was that we were still filling holes in memory.
If we inspect what happened in our program with strace, we see
$ strace -e mmap,munmap python3 test.py
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b7aa75000
mmap(NULL, 87539, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f4b7aa5f000
mmap(NULL, 921624, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f4b7a97d000
mmap(0x7f4b7a98d000, 483328, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x10000) = 0x7f4b7a98d000
mmap(0x7f4b7aa03000, 368640, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x86000) = 0x7f4b7aa03000
mmap(0x7f4b7aa5d000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xdf000) = 0x7f4b7aa5d000
This continues for a few screens. So there were apparently a ton of mmaps that happened,
even though we only called it once. Why?
By default,
CPython uses the pymalloc allocator.
The arena allocator uses the following functions: [...]
mmap()andmunmap()if available
So other objects are allocated via mmap! Also,
It uses memory mappings called “arenas” with a fixed size [...]. It falls back to
PyMem_RawMalloc()[...] for allocations larger than 512 bytes.
PyMem_RawMalloc() is a wrapper for malloc
which is itself an arena allocator:
the GNU C Library
mallocimplementation maintains multiple such areas [...]. Each such area is internally referred to as an arena.
It's arena allocators all the way down. But surely, our 1 MiB allocation and the pymalloc arenas are too large to fit in a glibc malloc arena?
The other way of memory allocation is for very large blocks, i.e. much larger than a page. These requests are allocated with
mmap
So, to summarize
- pymalloc allocates arenas with
mmapand large objects with glibcmalloc - glibc
mallocallocates large objects withmmap - oh, by the way, glibc
mallocextends the main arena withsbrk, but it allocates new arenas with...mmap
The only heap allocations happening in CPython that aren't mmap-ed are small, non-Python objects.
My holes theory was that previous allocations were mmaping over the same address that we were mmaping.
When the original allocations were freed, the kernel didn't actually bother decreasing the RSS.
So when we came along, we didn't use up the full 1 MiB of RSS.
Under this theory, I came up with a neat trick for using 1 MiB of memory.
Reserve a chunk in the address space,
then un-use memory without releasing the reserved chunk in the address space via
MADV_DONTNEED:
The
madvise()system call is used to give advice or directions to the kernel about the address range beginning at addressaddrand with sizelength.
I love giving advice or directions to the kernel!
The
advicevalues listed below allow an application to tell the kernel how it expects to use some mapped or shared memory areas
Perfect, I just mapped a memory area.
MADV_DONTNEED
Do not expect access in the near future. After a successfulMADV_DONTNEEDoperation, the semantics of memory access in the specified region are changed: subsequent accesses of pages in the range will succeed, but will result in [...] zero-fill-on-demand pages for anonymous private mappings.The kernel is free to delay freeing the pages until an appropriate moment. The resident set size (RSS) of the calling process will be immediately reduced however.
Wow, what a coincidence! We just happen to be measuring the RSS! This is a pretty crazy feature but, in conjuction with the previous trick, appears to reliably solve our problem.
def test_measure_memory():
# 1. allocate 1 MiB of address space somewhere
with mmap.mmap(-1, 1024 * 1024,
mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS) as mm:
# 2. free the underlying pages but hold onto the address space
mm.madvise(mmap.MADV_DONTNEED)
# 3. snapshot our memory usage (measure_memory.__enter__)
with measure_memory() as mem:
# 4. use memory
mm.move(512 * 1024, 0, 512 * 1024)
mm.move(0, 512 * 1024, 512 * 1024)
assert mem['rss_mib'] > 0.0
By forcing an allocation at a specific address in step 4, I thought I was avoiding interference from previous allocations.
Unfortunately, it still increases the memory by less than 1 MiB,
so we had to write the check as > 0.0.
But the good news is this version of the test doesn't flake, at least!
Someone asked me if this test was worth writing... I replied it might be better to delete the test, but it'd more fun to keep it.
The plot thickens
However, it turns out this theory is incorrect.
mmaping and writing to memory immediately allocates the memory;
unmapping immediately frees the memory.
According to my friend Ricky, what we were seeing was actually a cached RSS value.
psutil.Process.memory_info()
reads /proc/pid/statm.
statm is
known to be inaccurate:
(2) resident set size
(inaccurate; same as VmRSS in/proc/pid/status)
OK, let's go read about VmRSS.
Resident set size. Note that the value here is the sum of
RssAnon,RssFile, andRssShmem. This value is inaccurate; see/proc/pid/statmabove.
Uh, that's not very helpful. Let's go read some code. Here's a discussion thread about adding "inaccurate" to the docs.
Since
34e55232e59f7b19050267a05ff1226e5cd122a5(introduced back in v2.6.34), Linux uses per-thread RSS counters to reduce cache contention on the per-mm counters. With a 4K page size, that means that you can end up with the counters off by up to 252KiB per thread.
It then talks about SPLIT_RSS_COUNTING, but that actually has since been
removed:
This patch converts the
rss_statsintopercpu_counterto convert the error margin from(nr_threads * 64)to approximately(nr_cpus ^ 2).
drgn
In order to see if this is what's happening, we need to get the per-CPU RSS counters.
As far as I know, this isn't exposed in some handy /proc file,
so we'll just have to ask the kernel directly.
Conveniently, drgn helps us do just that!
First, let's tweak our code to slow down.
def test_measure_memory():
print('pid', os.getpid())
_ = bytearray(10 * 1024 * 1024)
del _
with mmap.mmap(-1, 1024 * 1024, mmap.MAP_PRIVATE | mmap.MAP_ANONYMOUS) as mm:
#mm.madvise(mmap.MADV_DONTNEED)
with measure_memory() as mem:
input('press enter to use memory')
mm.move(512 * 1024, 0, 512 * 1024)
mm.move(0, 512 * 1024, 512 * 1024)
input('memory used! press enter to continue')
print(mem)
Now, we fire up drgn.
$ pip3 install drgn
$ sudo apt install linux-image-amd64-dbg
$ sudo venv/bin/drgn
>>> counters = find_task(610602).mm.rss_stat
(struct percpu_counter [4]){
[...]
OK, we have an array of 4 percpu_counters, just like the last patch said.
It's an array of
these page counters.
We care about MM_ANONPAGES because we're using an anonymous mapping, so we want index 1.
>>> counters[1]
(struct percpu_counter){
.lock = [...]
.count = (s64)1709,
.list = (struct list_head){
.next = (struct list_head *)0xffff9d720bb35048,
.prev = (struct list_head *)0xffff9d720bb35098,
},
.counters = (s32 *)0x3da7f0c20ad0,
}
We already see the cached value that statm/psutil report:
>>> counters[1].count
(s64)1709
How do we get the per-CPU values, though?
Conveniently, when we loaded up drgn, it automatically imported a bunch of helpers, including
percpu_counter_sum.
>>> percpu_counter_sum(counters[1])
1709
Cool, these are the same number! Now, I hit enter to use memory in my python program and check the counters again.
>>> counters[1].count
(s64)1933
>>> percpu_counter_sum(counters[1])
1965
Great, the numbers are different!
These are measured in pages. To convert to bytes, we multiply by my page size of 4096.
So the cached/statm number is 1933 × 4096 = 7.551 MiB
and the sum of the per-CPU counters gives us 1965 × 4096 = 7.676 MiB.
Perhaps more importantly, the difference is (1965 - 1933) × 4096 = 0.125 MiB,
which is exactly the difference between what we were allocating above (1.0 MiB)
and what we were seeing in statm (0.875 MiB)!
The per-CPU counters also explain why the test flaked on our 128-core CI instances way more than on our laptops.
Accurate RSS
You may be wondering how to get accurate, uncached memory usage.
Going back to the proc(5) manpage,
Some of these values are inaccurate because of a kernel-internal scalability optimization. If accurate values are required, use
/proc/pid/smapsor/proc/pid/smaps_rollupinstead, which are much slower but provide accurate, detailed information.
psutil parses that for you here
inside memory_full_info.
Bonus
If you want to learn actually useful things about Linux memory management
like how to save memory by not garbage collecting, read the bonus section at the end of
my fork(2) blog post.