Futures and Segmented Stacks

This is just a note on getting the best performance out of an async program.

The point of using async IO over blocking IO is that it gives the user program more control over handling IO, on the premise that the user program can use resources more effectively than the kernel can. In part, this is because of the inherent cost of context switching between the userspace and the kernel, but in part it is also because the user program can be written with more specific understanding of its exact requirements.

There are two main axis on which async IO can gain performance over threads with blocking IO:

  • Scheduling time overhead: scheduling tasks in userspace can be substantially faster than scheduling threads in kernel space when implemented well
  • Stack memory overhead: userspace tasks can use far less memory per task than an OS thread uses per thread

Implementing the stack

There’s a lot of nuance to both points, but I want to talk a little bit about the second point right now.

Whether using async “tasks” or blocking “threads,” fundamentally the unit of work needs space to store its immediate state as work is done. For a “task,” as in Rust’s futures model, this is the state stored in the future object that is being polled to completion. For a “thread,” as in OS threads or Go’s goroutines, this is the thread’s stack. Let’s look at the implementation of stacks in more detail, so we can see better how futures can improve on them.

A stack is a defined space in which the executing unit of work can store its state. It’s implemented as a stack data structure: more stackframes are pushed to the stack with each function, and popped from it as they conclude. The interesting question comes with this issue: what do you do when the stack runs out of space? There are three basic options:

  1. Stack overflow: The stack has a preallocated amount of space, and when it runs out, an error is raised.
  2. Segmented stacks: When the stack runs out of space, a new segment is allocated for more space, which is cleaned up after it is no longer needed. Functionally, the stack is linked list.
  3. Copying stacks: When the stack runs out of space, a new, larger stack is allocated, and the contents of the current stack are copied to the new stack. Functionally, the stack is like a Rust Vec type.

C’s OS threads use the first strategy. However, this is where the memory overhead problem first appears. If you can only use the preallocated space, it needs to be enough space for program execution outside of pathlogical cases. Therefore, an OS thread normally allocates a very large stack. Since most tasks will never approach the maximum size of an OS thread, this is wasted space if you use a 1 task to 1 thread system.

The second option seems very appealing, but it runs into certain serious performance problems. The problem in particular is that creating a new segment is much more expensive than just pushing a stack frame onto a stack would be, but you don’t know where segments are going to need to be created. It’s possible that a function in a hot loop will straddle the segment boundary, requiring a new segment to be allocated and freed every iteration of the loop.

For this reason, both Rust and Go ultimately abandoned segmented stacks. (These two links are also a great resource for learning more about the way these three strategies are implemented, for what its worth.)

Go went with the third option. In order to copy stacks around, Go’s runtime needs to be able to rewrite points that point into the stack, which it can do becuse it already has a tracing garbage collector. Such a solution wouldn’t work in Rust, which is why Rust moved its greenthreads toward option 1, and then eventually got rid of them. In the long term, Rust’s greenthreads evolved into the futures model, which solves the problem rather differently.

Futures as a perfectly sized stack

One of the real breakthroughs of Rust’s futures model was its so-called “perfectly sized stack.” The state of a Rust future contains all of the state that future needs as it is suspended across yield points. Therefore, futures do not need to worry about running out of space in their “stack,” because they always allocate exactly the maximum amount of space they could ever need. Thus we sidestep the question entirely. However, this does not come without its own problems.

The first is the problem of recursive async functions. The compiler cannot determine how many times a recursive async function will recur, and therefore it cannot determine how much state that future will need to store, which corresponds to how deeply the function recurses. Therefore, async functions cannot be recursive without heap allocating the state of the recursion at some point. Whereas with threads the thread will happily use more stack space as you recur, ultimately handling stack overflows with whatever mechanism the runtime has implemented (e.g. an error or a reallocation), with futures you will encounter a compiler error.

In effect, this acts the same way as attempting to defining a recursive type, and at a fundamental level it’s basically the same thing. A recursive async function would have to generate a recursive type, which can’t have a statically known size. But if you heap allocate the state of the recursion, that’s just like heap allocating the recursive field of a recursive type: we no longer need to know the depth of recursion to determine the size of the type.

It’s worth thinking about what’s happened here, though: essentially, you have created a segmented stack in your future. The point where you heap allocate the recursive future is the segment point in your stack; when your code reaches that point, it performs a new allocation for more future state, instead of having it allocated inline. This has all the potential performance pitfalls of segmented stacks (its more expensive to heap allocate than use the space thats already allocated), but with one important difference: instead of the segment allocation occuring at a random, unpredictable time, you explicitly denote in your code that you want the allocation to occur.

Segmenting your futures for fun and profit

Futures may be the perfectly sized stack, but sometimes the perfect size is actually too big. Consider a future with several branches. Let’s say that in most branches it could follow, it needs only a very small amount of space: maybe a few kilobytes. But if it follows one branch, which it only follows a very small portion of the time, it needs several megabytes of space.

If this is how you’ve written your code, and you just use async/await, you’ll end up allocating a task that is several megabytes in size, every time. You’re not getting any of the memory overhead advantages in the common case, where you need only a few kilobytes of space. You’ve got a perfectly sized stack for every occasion, but in most occasions most of that space is wasted.

But you can explicitly segment your future state by storing the state of that heavy branch in a separate heap allocation: just Box the call to that branch, and it will not be stored inline. Suddenly, your main task is down to a few kilobytes in size, and the extra allocation stores the megabytes of space for the one uncommon branch. It’s worth the extra heap allocation to make the rest of your tasks that don’t take that branch much smaller.

This is essentially a segmented stack, but under the user’s control. Unlike segmented stacks used by greenthreads, users choose where the segments go. This means users can make sure the segmentation never occurs in a hot loop, but is lifted outside of that loop as needed. So we don’t encounter the performance pitfalls that automatic segmented stacks encounted in earlier versions of Go and Rust.

So if you’re writing an async program and noticing some problems with its memory overhead, one possibility is that you’re allocating large tasks that you don’t actually use the space of most of the time. You could consider finding a good position to segment off most of the space so that it only gets allocated when its needed. Ideally, we’d have better tools for identifying tasks that have this property, maybe even suggesting where the segmentation should go.