Concurrency Through the Ages

June 01, 2017

How do I make my computer do lots of things fast (especially if the computer runs Linux and the things are being done in Python)?

I/O- or CPU-bound?

First, what is the work that is being done? Does your program spend 99+% of its time waiting on I/O (generally disk or network)? If I gave you a drive or network connection that was twice as fast, would your program finish in about half the time? (Is your program a webserver that needs to handle lots of incoming connections?) If so, it is I/O-bound.

Does your program instead spend 99+% of its time using the CPU to do computation? If I gave you a CPU that was twice as fast, would your program finish in about half the time? (Is your program calculating the average of a bunch of numbers or running a bunch of unit tests for a bunch of code?) If so, it is CPU-bound (and maybe you should think about distributing it across multiple machines instead of trying to achieve parallelism through any of the methods discussed below).

Do you believe your program is equally I/O- and CPU-bound? Well then you are almost certainly wrong.

Blocking

When a process wants to do computation, the kernel runs it for a while and then interrupts it occasionally to see if there are other things that need doing (scheduling).

When a process wants to read/write some data from a disk or the network, it often needs to wait a while for that to happen. The "default" way of achieving this is by making a blocking call, so-named because the kernel's scheduler blocks your process from running until your I/O is done.

fork

The original (ok, just the oldest I care about) way of achieving concurrency: make a new process with fork. The new process gets a complete copy of the memory of the old process that it can do whatever it wants to, so you "never" need to worry about race conditions. On the other hand, the new process needs a complete copy of the memory of the old process, so this is expensive both in terms of CPU cycles and RAM.

In Python, a forking webserver might look like this:

while True:
    conn, _ = sock.accept()
    pid = os.fork()
    if pid == 0: # child process
        request = conn.recv(4096)
        ... # process the request
        conn.sendall(response)
        sys.exit()

When nothing is happening, the parent is blocked on accept (and there are no child processes). When a request comes in, the parent spawns the child and immediately goes back to accept. The child makes a potentially-blocking call to recv and a blocking call to send.

The kernel schedules each process to run when it needs to. If handling each request is CPU-intensive, you will be able to handle as many requests simultaneously as you have cores. (On the other hand, you will only be able to handle as many requests as you have cores, so see again that bit about distributing instead of doing this.) If your work is not CPU-intensive, you can handle more requests than that... but not as much as you might think. fork uses a lot of CPU cycles and RAM isn't free. This method gobbles both up quickly. (You can save some time if you prefork, but that only partially solves half your problems.)

clone (A.K.A. kernel threads)

What if we solved the RAM problem by just... not making a copy of the memory? Tell the kernel to schedule a separate "thread" of execution within the same process. This works quite well... sometimes. You do run into more race conditions because all of the threads share the same memory.

A threaded webserver might look like this:

while True:
    conn, _ = sock.accept()
    _thread.start_new_thread(handle_conn, (conn,))

def handle_conn(conn):
    request = conn.recv(4096)
    ... # process the request
    conn.sendall(response)
    conn.close()

The kernel schedules each thread as before. clone isn't free, so sometimes you make the threads beforehand ("thread pool", not "pre-clone"). If your work is CPU-intensive, things will run on as many cores as you want in theory. Yes, the Python global interpreter lock will bite you and prevent you from achieving parallelism. If your work is not CPU-bound, threads are a perfectly fine way of achieving concurrency, even in the presence of the GIL (threads blocked by the kernel scheduler on I/O aren't going to run whether the GIL lets them or not).

The kernel still needs to store some bookkeeping information about each thread (sometimes called the "tcb"; see here and apologies in advance for Linux calling threads "processes", calling processes "processes", and calling everything "tasks") and each thread needs its own stack, so memory usage is better but not necessarily great. And, of course, the race condition/shared memory thing means all of your code and all of your libraries and all of their dependencies and so on must be thread-safe.

When one thread blocks and another is ready to run, the kernel switches execution from one to the other. It doesn't know anything about what is running, so it does this very safely: push all of the registers on the stack, switch to the new thread's stack, and pop all of the registers off. Context switching is generally "pretty fast", but in an I/O-bound program with very high-concurrency (static file server), the amount of time spent switching can become significant. You might find that your I/O-bound worker threads can't be run fast enough, making the overall process effectively CPU-bound.

poll/select/epoll (A.K.A. async I/O)

What if we solved the context switching problem by scheduling our own workers? We could ask the kernel when each socket/file had data to be read (or was ready for writing) and manage our own threads of execution.

An async webserver hopefully doesn't look like this:

sel = selectors.DefaultSelector()

def accept(sock):
    conn, _ = sock.accept()
    sel.register(conn, selectors.EVENT_READ, handle_conn)

def handle_conn(conn):
    request = conn.recv(4096)
    ... # process the request
    conn.sendall(response) # cross fingers
    sel.unregister(conn)
    conn.close()

sock.setblocking(False)
sel.register(sock, selectors.EVENT_READ, accept)

while True:
    for key, _ in sel.select():
        callback = key.data
        callback(key.fileobj)

We are crossing our fingers when sending the response because we are hoping that send doesn't block. It usually doesn't, but if there are 10 clients connected and 1 of them receives slowly, send will block, causing the kernel's scheduler to block the entire program. The other 9 clients will not be able to continue until that one finishes receiving what we send.

But wait, I forgot to conn.setblocking(False)! If we had done that, instead of blocking, send would error (EAGAIN). The correct way to do this is to unroll the sendall loop and repeatedly register a callback for selectors.EVENT_WRITE until we've sent all the data. This also means we need to keep track of what each connection has sent so far somewhere. Implementing timeouts, handling disconnects, etc. are all much harder here than in the above 2 versions.

As you can see, this is somewhat more code and quite a bit more complex. There are libraries for dealing with this complexity (for C, libevent, libev, libuv; for Python, twisted, tornado, gevent, eventlet, asyncio, curio) by modeling the interactions as events and running the event loop above for you, but none of them make the complexity disappear. Let's say that you want to make a database query while processing the request. Your database driver can't block or you would lose all concurrency, so your DB driver must now be hooked into your event loop. If you use an HTTP client, redis client, memcache client, those must be hooked in too. If you use an Elasticsearch client, the HTTP client it uses must be hooked in too. And if you accidentally use a library that uses a library that uses a library that does I/O outside of the event loop, you may not notice that it's blocking until after you've run it at high concurrency, since with a single connected client the performance is the same.

All this is to say that it is often easier to make your code thread-safe than it is to hook every last bit of your I/O into an event loop. I'm framing this as an either/or decision because in an async cooperative yielding environment, you only run other code when you would block, so race conditions are less common. On the other hand, the majority of libraries are written in a pre-async world and re-implementing a Postgres driver is no fun. Debugging a Postgres driver is even less fun and deploying a brand new Postgres driver is just a different kind of finger-crossing.

Also, it's probably obvious, but CPU-bound workloads don't do well here. You can have multiple event loops running on different cores in different threads/processes, but you need a really compelling reason to take on that kind of complexity.

Rant

I linked a bunch of async frameworks/event loops above and I consider all but 2 of these to be mistakes. Twisted gets a pass because it came first. People wrote a bunch of libraries for it like twisted.web.client.Agent for HTTP and txpostgres for Postgres. Then came tornado, because callback hell is hell and nobody wanted that. But we couldn't use the existing async libraries because they only worked with Twisted, so people wrote... a few libraries for it, like tornado.httpclient for HTTP and momoko for Postgres.

Then, Python 3 introduced async/await syntax and rolled their own event loop called asyncio. And people wrote some libraries against it too, like aiohttp for HTTP and aiopg for Postgres. But the asyncio API is kind of a tirefire so someone made a better one called curio and it has asks for HTTP and maybe one day it will have a Postgres driver too.

And maybe one day, any of these libraries will develop enough of an ecosystem to be well-supported, feature-rich, and battle-tested. But the longest player at this game (Twisted) has been at it for 14 years (2002) and it's only meeting those requirements if you believe really hard.

But what if we just didn't have this library/ecosystem problem? What if we hooked all of the existing non-async, feature-rich, battle-tested libraries into our event loop? What if we monkey-patched socket.recv and the like to register with our event loop (using stack manipulation magic provided by greenlet), thus enabling libraries written in a pre-async world to work with ours and provide an apparently blocking API at the same time?

Then, our async webserver would look like this:

while True:
    conn, _ = sock.accept()
    eventlet.spawn(handle_conn, conn)

def handle_conn(conn):
    request = conn.recv(4096)
    ... # process the request
    conn.sendall(response)
    conn.close()

(Actually, it could be the exact same if you monkey-patched the _thread and socket libraries.) You'll notice this is basically the same as our threaded version. You'll also notice that we recv and send in the same function, as if we were blocking, without registering a twisted/tornado callback or using a tornado co-routine or messing with an unending stack of asyncio/curio awaits.

This is what eventlet and gevent do. They provide all the concurrency benefits of async I/O and all of the simlpicity of blocking code. What are the problems?

  1. You must monkey-patch before importing anything.
  2. Some C extensions do I/O and can't be monkey-patched.
  3. Some people object to monkey-patching.

#1 is a small price to pay. Sure, things break if you forget to monkey-patch, but unlike in the other accidentally-blocked-the-whole-process cases, here you just eventlet.monkey_patch() instead of, say, debugging to figure out which library is blocking and picking a whole new library and rewriting your integration.

#2 is a real concern. If you want async and you absolutely need a particular C extension, either that module works with some other event loop or you are probably not going to actually have async. If you just want concurrency and the flexibility of using many libraries, you want threading. (Fun fact: psycopg2 is a C extension but supports these anyway.)

#3 can be expanded as "monkey-patching is bad because it makes your code harder to reason about and/or introduces subtle bugs in your code". But it is not particularly easy to reason about async in general and hiding that away makes the overall complexity quite a bit lower. We trade away having to think about callbacks/await all over our application code for complexity contained in eventlet.

As for bugs resulting in doing such an invasive monkey-patch, eventlet has been around for 11 years (2006). Second Life made heavy use of it and while it's true the business and product were not terribly successful, it did not fail for technical reasons. We used eventlet from nearly the start at Mixpanel and it meant that we both had async with minimal effort, could use any (pure-python) libraries we wanted, and most developers didn't have to think about or understand async I/O.

If you are writing networked code (it's 2017, who isn't?) and

  1. you want a language with exceptions like Python or Ruby
  2. you want a language with a healthy ecosystem like Python or Go
  3. you anticipate high concurrency

then you probably want eventlet.

Plug

If this sort of thing interests you, take a look at raylu.net/systems.