Thread-per-core

October 6, 2023

I want to address a controversy that has gripped the Rust community for the past year or so: the choice by the prominent async “runtimes” to default to multi-threaded executors that perform work-stealing to balance work dynamically among their many tasks. Some Rust users are unhappy with this decision, so unhappy that they use language I would characterize as melodramatic:

The Original Sin of Rust async programming is making it multi-threaded by default. If premature optimization is the root of all evil, this is the mother of all premature optimizations, and it curses all your code with the unholy Send + 'static, or worse yet Send + Sync + 'static, which just kills all the joy of actually writing Rust.

It’s always off-putting to me that claims written this way can be taken seriously as a technical criticism, but our industry is rather unserious.

(Some people instead prefer single threaded servers, claiming that they are “IO bound” anyway. What they mean by IO bound is actually that their system doesn’t use enough work to saturate a single core when written in Rust: if that’s the case, of course write a single threaded system. We are assuming here that you want to write a system that uses more than one core of CPU time.)

What these people advocate instead is an alternative architecture that they call “thread-per-core.” They promise that this architecture will be simultaneously more performant and easier to implement. In my view, the truth is that it may be one or the other, but not both.

Thread-per-core

One of the biggest problems with “thread-per-core” is the name of it. All of the multi-threaded executors that users are railing against are also thread-per-core, in the sense that they create an OS thread per core and then schedule a variable number of tasks (expected to be far greater than the number of cores) over those threads. As Pekka Enberg tweeted in response to a comment I made about thread per core:

Thread per core combines three big ideas: (1) concurrency should be handled in userspace instead of using expensive kernel threads, (2) I/O should be asynchronous to avoid blocking per-core threads, and (3) data is partitioned between CPU cores to eliminate synchronization cost and data movement between CPU caches. It’s hard to build high throughput systems without (1) and (2), but (3) is probably only needed on really large multicore machines.

Enberg’s paper on performance, which is called “The Impact of Thread-Per-Core Architecture on Application Tail Latency” (and which I will return to in a moment), is the origin of the use of the term “thread-per-core” in the Rust community. His understanding of the definition of thread-per-core is probably relevant here. He enumerates three different features of a thread-per-core architecture, of which he says only two are absolutely required for high throughput. This is helpful, because the dispute is really only about the third point, not the first two; if you are using async Rust, you are meeting both of those requirements.

The distinction being made is really between two optimizations you can make once you have a thread-per-core architecture, and which are in tension: work-stealing tasks between your threads and sharing as little state as possible between them.

Work-stealing

The point of work-stealing is to improve tail latency by ensuring that every thread always has work to do.

A problem that emerges in real systems is that different tasks end up requiring different amounts of work. For example, one HTTP request may require far more work to serve than another HTTP request. As a result, even if you try to balance work up front among your different threads, they can each end up performing different amounts of work because of unpredictable differences between the tasks.

Under maximum load, this means that some threads will be scheduled more work than they can perform, while other threads will sit idle. The degree to which this is a problem depends on the degree to which the amount of work performed by different tasks differs. Work-stealing is a mitigation to this problem: threads with nothing to do “steal” work from the other threads that have too much work, so that they do not sit idle. tokio, async-std, and smol all implement work-stealing with the goal of reducing tail latency and improving CPU utilization.

The problem with work-stealing is that it means a task can run on one thread, pause, and then be started again on another thread: that’s what it means for the work to be stolen. This means that any state that is used across a yield point in that task needs to be thread-safe. This appears in Rust’s APIs as futures needing to be Send, which can be difficult for people with a poor view of their system’s state to figure out the best way to ensure. This is why work-stealing is said to be “harder.”

At the same time, if state is moved from one thread to another, this introduces synchronization costs and cache misses, violating the principles of a “share-nothing” architecture, in which each CPU has exclusive access to the state it operates on. This is why work-stealing is said to be “slower.”

The point of share-nothing is to improve tail latency by keeping data in the faster caches that belong to a single CPU core, rather than the slower caches shared by multiple cores.

I want to return to Enberg’s paper, which demonstrates the performance improvements of a share-nothing architecture over a shared-state architecture by benchmarking a new key-value store (which is share-nothing) against memcached (which is shared-state). Enberg shows substantial improvements in tail latency between the two architectures. I like this paper a lot, but I think the way it has been deployed in the Rust community as a soundbite (“71% performance improvement!”) is shallow and unhelpful.

To achieve a share-nothing architecture, Enberg’s key/value store partitions the keyspace over the different threads using a hash function, and partitions incoming TCP connections over the threads using SO_REUSEPORT. Then, it routes requests from the thread managing the connection to the thread managing the relevant section of the keyspace using message passing channels. In contrast, in memcached all of the threads share ownership of the keyspace, which is partitioned, and each partition is protected by a mutex.

Enberg’s paper shows that using channels over using mutexes can achieve lower tail latency. This is presumably because there are fewer cache misses, as each partition, which is accessed over and over again, stays in only one core’s cache. However, I’m not at all convinced that Enberg’s architecture is dramatically easier to implement than memcached’s. Enberg’s goal is to make use of advanced kernel features and a carefully planned architecture to avoid data movement, it’s hard for me to believe this would be easier than wrapping data inside a mutex.

A key-value store is pretty much the ideal case for a share-nothing architecture, because it is fairly trivial to partition the state of the application among diferent threads. But if your application is more complicated, and requires mutating state in multiple partitions in a transactional or atomic manner, this requires a lot more attention to implement correctly. There’s a strong analogy between the advocates of a share-nothing architecture and the hype for eventually consistent databases over databases that enforced serializability ten years ago. Yes, this can be more performant, but at the expense of requiring careful consideration to avoid bugs that result from data inconsistency.

It’s also important to note that neither Enberg’s implementation nor memcached use work-stealing. This makes it difficult to relate the core performance claims of Enberg’s paper to Rust’s work-stealing architectures. One wonders what the results would be to just add work-stealing to Enberg’s architecture and memcached’s. In Enberg’s it would increase data movement somewhat, but possibly in a manner which still maximizes CPU utilization. I can’t imagine it would do anything but help memcached.

Enberg has carefully designed the implementation in the paper to try to evenly distribute work in advance, using a balanced partition of the keyspace and SO_REUSEPORT. Despite this, there are several sources of dynamic imbalance that would appear in practice:

Hot keys will receive more reads and writes, which causes the thread managing their keyspace to receive more work.
Some connections will perform more requests than others, which causes the thread managing those connections to receive more work.

My understanding of the paper is that the benchmarking framework did not replicate these conditions, which would appear in the real world: each connection performs a consistent amount of work, operating on random keys, so it avoids these sources of imbalance. I wonder what benchmarks which add work stealing would show if these kinds of dynamic imbalance were tested.

One can imagine others ways to architect a share-nothing system that may mitigate these forms of imbalance (such as caching hot keys on additional partitions). And some form of work-stealing may be such an optimization even if some tasks stay pinned to certain cores to avoid moving their state.

No one would dispute that carefully architecting your system to avoid moving data between CPU caches will achieve better performance than not doing that, but I have a hard time believing that someone who’s biggest complaint is adding Send bounds to some generics is engaged in that kind of engineering. If you’re going to be using shared state anyway, it’s hard to imagine that work-stealing doesn’t improve your CPU utilization under load.

Thread-per-core

Thread-per-core

Work-stealing

Share-nothing