Blog

Python's Default Hash Function

June 03, 2020

In python, you can create your own class and stick instances of it in a set or use them as keys of a dict without overriding any dunder methods. But what actually happens when you do this? How does python hash your object?

We can trace the CPython object type and find this code

/* bottom 3 or 4 bits are likely to be 0; rotate y by 4 to avoid
   excessive hash collisions for dicts and sets */
y = (y >> 4) | (y << (8 * SIZEOF_VOID_P - 4));

in either https://github.com/python/cpython/blob/2.7/Objects/object.c#L1090-L1092 or https://github.com/python/cpython/blob/3.8/Python/pyhash.c#L136-L138. y is a pointer to the object, so we're rotating the address by 4 and using that as the hash. Why does CPython do this?

Fact check

First, let's confirm that we've even traced the code correctly.

>>> class Thing: pass
... 
>>> thing = Thing()

Now we need a pointer to our object. Conveniently, https://docs.python.org/3/library/functions#id

CPython implementation detail: This is the address of the object in memory.

>>> y = id(thing)
>>> hex(y)
'0x7f9c44ede0d0'

OK, we have our pointer (and hey, like the comment says, the bottom 4 bits are 0; remember that a hexit represents 4 bits). Now let's see what our object hashes to.

>>> hash(thing)
8769321754125
>>> hex(hash(thing))
'0x7f9c44ede0d'

Yep, that looks like the bottom 4 bits have rotated to the top (this could be written 0x07f9c44ede0d, but hex() drops leading 0s). If you want to perform this calculation yourself rather than eyeballing the hex, remember that ints are not a fixed size, so you need to zero out extraneous high bits.

>>> hex((y >> 4) | (y << (8*8-4)) & 0xffffffff)
'0x7f9c44ede0d'

OK, so we've found the right code for hashing objects that don't override __hash__. Now, why does CPython do this?

Hash tables

dicts and sets are basically hash tables. A hash table is basically an array of its entries. You allocate some number of buckets to store stuff in. Let's say you want to store 3 things in a dict so you allocate, like, 5 buckets.

thing hashes to 8769321754125, so where do you store it? At index 8769321754125 % 5 = 0, of course. You need some plan for collision resolution, but that's beyond the scope of this blog post. For now, let's focus on the fact that modulo effectively leaves only the low bits.

To build a good hash table, you want to have as few collisions as possible. So a good hash function is both different for different keys (and the memory addresses of objects are completely unique, so we're covered there) and, pre-modulo, results in different low bits. This is why the comment says

/* bottom 3 or 4 bits are likely to be 0; rotate y by 4 to avoid
   excessive hash collisions for dicts and sets */

Memory alignment

Are the addresses of objects likely to have the bottom 3 or 4 bits be 0?

>>> hex(id(Thing()))
'0x7f9c44ede290'
>>> hex(id(Thing()))
'0x7f9c44ede210'
>>> hex(id(Thing()))
'0x7f9c44ede2d0'
>>> hex(id(Thing()))
'0x7f9c44ede250'
>>> hex(id(Thing()))
'0x7f9c44ede290'
>>> hex(id(Thing()))
'0x7f9c44ede210'

Yes. Why? In a word, "alignment". In slightly more words: 64-bit CPUs are optimized for accessing 8 bytes of memory at a time, aligned to an 8 byte boundary (0-7, 8-15, 16-23, ...). Therefore, many tools conspire to make sure that my data starts on an 8 byte boundary (in this case, the most notable are probably the C compiler and the malloc implementation).

So the address of an object is likely to be 0 modulo 8, which is another way of saying the bottom 3 bits are 0.

But why are the bottom 4 always 0 on my machine? Python aligns objects to 16-byte boundaries on 64-bit machines: https://github.com/python/cpython/blob/3.8/Objects/obmalloc.c#L846. Why? Because https://github.com/python/cpython/commit/f0be4bbb which links to this bug which says

ubsan complains about unaligned access when structs include "long double". [...]

This is because (on x86 anyway), long double is 16-bytes long and requires that alignment, but obmalloc only gives a 8-byte alignment. (glibc malloc() gives 16-byte alignment.)

Unrelated fun with ctypes

import ctypes

class Five:
    def __init__(self):
        self.number = 5

five = Five()
pyobject = ctypes.cast(id(five)+16, ctypes.POINTER(ctypes.py_object))
pyobject.contents.value['number'] = 7
print(five.number)

Just in case you ever find yourself in a position where you have the id of something in CPython but not a reference or you are working with an object defined in a C extension...