The Waker API II: waking across threads

January 11, 2019

In the previous post, I provided a lot of background on what the waker API is trying to solve. Toward the end, I touched on one of the tricky problems the waker API has: how do we handle thread safety for the dynamic Waker type? In this post, I want to look at that in greater detail: what we’ve been doing so far, and what I think we should do.

Restating the problem

The goal of this portion of the API is to ensure we can support all of the kinds of waker implementations that are necessary. In particular, we want to be able to support implementations that have special behavior when called from the same thread the waker was originally constructed on. There are two variations on this:

The more common variation is to have an optimization specific to waking from the original thread, though you do support waking from different threads as well.
A more niche use case is to only support waking from the same thread. In this implementation, the executor is designed for programs that use no multithreading at all, and it’s tightly coupled to a particular reactor design.

We’ve gone through a couple of iterations on this API. The design currently implemented on nightly has two waker types: Waker and LocalWaker. The difference between them is that the latter is not Send or Sync, and will call a specialized wake_local function when it is woken, instead of the default wake function. However, you can always convert a LocalWaker into a Waker using the into_waker method.

This is perfectly designed to support the first use case I described above, but the second is a bit trickier. As I outlined in my previous blog post, there are three ways to implement a waker. One is only suitable to no-std embedded environments and not relevant here, so I’ll reiterate the other two:

In the first, the waker is a TaskId which is used to identify the task to be woken.
In the second, the waker is a reference counted pointer to the task itself, which is then put back into the queue of tasks to be woken next.

The API I just described does not support the second case using a non-atomic Rc. This is because you could construct a Waker, move it to another thread, and clone or drop it. This introduces a data race in access to the reference count.

For that reason, the RFC currently proposes to change the API, getting rid of wake_local, and using a different strategy instead. In this strategy, there’s instead an into_waker hook that the implementation can use to either change its wake implementation (in the case where it just has a same-threaded optimization) or panic (in the case where it is not meant to be called from multiple threads).

From an end user’s perspective, the API is largely unchanged: there are two waker types, LocalWaker and Waker, with the same conversions between them. But we’ve now supported one additional implementation. So that seems like a win. But the problem is this: it is exactly this unchanged portion of the API that has a lot of costs for users of the API.

The high costs of distinguishing `Waker` from `LocalWaker`

I had the opportunity to use the waker API extensively recently (in creating the [romio][romio] crate). The distinction between Waker and LocalWaker had not existed the last time I had dealt with the futures API, so I was experiencing it very much as a newcomer. And I’m afraid I must admit: I was, at first, quite baffled. A lot of strangeness conspires to make this API exceptionally confusing:

You receive a LocalWaker from the executor, rather than a Waker. It’s unclear without a lot of explanation whether you’re supposed to convert it to a Waker (the thing you probably really want) early or not.
LocalWaker is not Send or Sync, but Waker is, and there’s a conversion from LocalWaker to Waker. This looks very odd: it makes it hard to understand why LocalWaker isn’t threadsafe, it can be converted directly to a threadsafe version.
The AtomicWaker API in the futures library receives an &LocalWaker argument. Internally, it converts that immediately to a Waker. But this means that a library like romio is exclusively dealing with &LocalWaker, never directly seeing the Waker type. And yet, because the API it uses makes the conversion to Waker, it is incompatible with a local-only executor. This is uninituitive and surprising.
Having more versions of things is just inherently more confusing. Especially with multiple ways to construct a Waker/LocalWaker (from Wake or UnsafeWake/RawWaker), there’s now a grid of combinations between different API components, and understanding how they all relate (or don’t) is hard to learn, on top of learning how to use the APIs probably.

It would be much simpler if the API that a future used could just look like this:

struct Waker { ... }

impl Send for Waker { ... }
impl Sync for Waker { ... }

impl Waker {
    fn wake(&self) { .... }
}

This is what the API looked like for years (under different names in earlier periods). It seemed to work well. So I asked myself: what are really getting for this additional complexity, and is it worth it?

What are we getting for this?

I talked my concerns through with cramertj, and ultimately we reached these conclusions:

The first use case - the optimization for the same thread - can easily just use TLS: either literally checking that its on the same thread or (more likely) storing its thread-local queue in TLS and checking if the thread-local queue exists or not. In other words, the first use case really needs no additional support from the API, LocalWaker isn’t necessary for it to be supported.
The second use case is more interesting. There is one thing that indeed cannot be supported without the API distinction: using an Rc, non-atomically reference counted task. There are still other ways to implement a singlethreaded event loop, however:
- Using the Task IDs technique instead of the reference counting technique. Panic when you wake from another thread, instead of when you move to it. This strategy works completely fine.
- Using atomic reference counts. Since your application is single threaded, on x86 at least this should have essentially no overhead over using nonatomic reference counts.

So I had to ask myself: is forcing every author of a manual future to deal with this complexity and unintuitiveness worth it to allow one particular of the multiple implementation strategy for a niche executor use case? To me the answer was clear: we’re paying a cost in API ergonomics that doesn’t actually buy us very much.

cramertj agreed. We talked about this before the holidays. When I came back I started this blog series, whereas he just wrote a PR to the RFC. This PR would be the last major change to the futures API before stabilization. By eliminating the distinction between Waker and LocalWaker, I think the waker API becomes much more comprehensible.

The Waker API II: waking across threads

Restating the problem

The high costs of distinguishing Waker from LocalWaker

What are we getting for this?

The high costs of distinguishing `Waker` from `LocalWaker`