Python's Default Hash Function
June 03, 2020
In python, you can create your own class and stick instances of it in a set
or use them as keys of a dict
without overriding any dunder methods.
But what actually happens when you do this?
How does python hash your object?
We can trace the CPython object type and find this code
/* bottom 3 or 4 bits are likely to be 0; rotate y by 4 to avoid
excessive hash collisions for dicts and sets */
y = (y >> 4) | (y << (8 * SIZEOF_VOID_P - 4));
in either
https://github.com/python/cpython/blob/2.7/Objects/object.c#L1090-L1092
or https://github.com/python/cpython/blob/3.8/Python/pyhash.c#L136-L138.
y
is a pointer to the object, so we're rotating the address by 4 and using that as the hash.
Why does CPython do this?
Fact check
First, let's confirm that we've even traced the code correctly.
>>> class Thing: pass
...
>>> thing = Thing()
Now we need a pointer to our object. Conveniently, https://docs.python.org/3/library/functions#id
CPython implementation detail: This is the address of the object in memory.
So
>>> y = id(thing)
>>> hex(y)
'0x7f9c44ede0d0'
OK, we have our pointer (and hey, like the comment says, the bottom 4 bits are 0; remember that a hexit represents 4 bits). Now let's see what our object hashes to.
>>> hash(thing)
8769321754125
>>> hex(hash(thing))
'0x7f9c44ede0d'
Yep, that looks like the bottom 4 bits have rotated to the top
(this could be written 0x07f9c44ede0d
, but hex()
drops leading 0s).
If you want to perform this calculation yourself rather than eyeballing the hex,
remember that int
s are not a fixed size,
so you need to zero out extraneous high bits.
>>> hex((y >> 4) | (y << (8*8-4)) & 0xffffffff)
'0x7f9c44ede0d'
OK, so we've found the right code for hashing objects that don't override __hash__
.
Now, why does CPython do this?
Hash tables
dict
s and set
s are basically hash tables.
A hash table is basically an array of its entries.
You allocate some number of buckets to store stuff in.
Let's say you want to store 3 things in a dict
so you allocate,
like, 5 buckets.
thing
hashes to 8769321754125, so where do you store it?
At index 8769321754125 % 5 = 0
, of course.
You need some plan for collision resolution,
but that's beyond the scope of this blog post.
For now, let's focus on the fact that modulo effectively leaves only the low bits.
To build a good hash table, you want to have as few collisions as possible. So a good hash function is both different for different keys (and the memory addresses of objects are completely unique, so we're covered there) and, pre-modulo, results in different low bits. This is why the comment says
/* bottom 3 or 4 bits are likely to be 0; rotate y by 4 to avoid
excessive hash collisions for dicts and sets */
Memory alignment
Are the addresses of objects likely to have the bottom 3 or 4 bits be 0?
>>> hex(id(Thing()))
'0x7f9c44ede290'
>>> hex(id(Thing()))
'0x7f9c44ede210'
>>> hex(id(Thing()))
'0x7f9c44ede2d0'
>>> hex(id(Thing()))
'0x7f9c44ede250'
>>> hex(id(Thing()))
'0x7f9c44ede290'
>>> hex(id(Thing()))
'0x7f9c44ede210'
Yes. Why?
In a word, "alignment".
In slightly more words: 64-bit CPUs are optimized for accessing 8 bytes of memory at a time,
aligned to an 8 byte boundary (0-7, 8-15, 16-23, ...).
Therefore, many tools conspire to make sure that my data starts on an 8 byte boundary
(in this case, the most notable are probably the C compiler and the malloc
implementation).
So the address of an object is likely to be 0 modulo 8, which is another way of saying the bottom 3 bits are 0.
But why are the bottom 4 always 0 on my machine? Python aligns objects to 16-byte boundaries on 64-bit machines: https://github.com/python/cpython/blob/3.8/Objects/obmalloc.c#L846. Why? Because https://github.com/python/cpython/commit/f0be4bbb which links to this bug which says
ubsan complains about unaligned access when structs include "long double". [...]
This is because (on x86 anyway), long double is 16-bytes long and requires that alignment, but obmalloc only gives a 8-byte alignment. (glibc malloc() gives 16-byte alignment.)
Unrelated fun with ctypes
import ctypes
class Five:
def __init__(self):
self.number = 5
five = Five()
pyobject = ctypes.cast(id(five)+16, ctypes.POINTER(ctypes.py_object))
pyobject.contents.value['number'] = 7
print(five.number)
Just in case you ever find yourself in a position where you have the id of something in CPython but not a reference or you are working with an object defined in a C extension...