Ringbahn II: the central state machine

Last time I wrote about ringbahn, a safe API for using io-uring from Rust. I wrote that I would soon write a series of posts about the mechanism that makes ringbahn work. In the first post in that series, I want to look at the core state machine of ringbahn which makes it memory safe. The key types involved are the Ring and Completion types.

Note: the API has evolved since my previous post. The strategy I documented in my first blog post of wrapping std types in a Ring type has not worked out well because of divergences between the underlying API on Linux for sockets, files, and so forth. Instead, ringbahn provides its own set of File, TcpListener, and TcpStream types. The Ring type discussed in this post is a low-level building block on which these types, as well as types like Submission, are built.

Also, the implementation has been coming along nicely. I still want to make some big changes to ringbahn before releasing even a version 0.1, but its now been far more stress-tested than it was when I wrote the previous post. Most notably, thanks to dpbriggs, I’ve been able to run redis benchmarks against a version of redis-oxide built on top of ringbahn. This increases our confidence that there aren’t lurking memory bugs in the implementation. Much of the implementation has also become much safer (as in, using less unsafe code), as I’ve gotten a clearer idea of the correct internal structure of the code.

Completion: the shared state between the kernel and the user

When an IO event is submitted to an io-uring instance, the userspace program sets a field of the submission queue event called user_data. This field is 64 bits wide, and can contain anything. When the kernel completes the IO event, the completion queue event that the user gets back contains a user_data field, which is the same data as the field used to submit this same event. That is how users can associate a completed event with the event that submitted it.

In order to integrate io-uring with Rust’s future ecosystem, we need a way to store a waker that will be awoken when that completion finishes. Because user_data is only 64 bits, and a Waker is two words (meaning it is 128 bits on 64-bit platforms), we cannot just store the waker in the user data field. That’s fine, because we will need more state anyway (we’ll get to that in a minute).

So instead, ringbahn heap allocates the waker, along with the other state, in a type called a Completion. The address of the completion is stored in the user data field, and used to wake up the task waiting on this event once the kernel has completed it. Ownership of the completion is shared by the Submission future and the event that’s been passed to the kernel; only once both of them are done with the completion is it deallocated.

Because the completion has two owners, there are basically two sequences in which each of them will lose interest in the completion:

  • If the kernel loses interest the completion first, that means the event has completed.
  • If the user code loses interest in the completion first, that means the event has been cancelled.

The first option is the happy path, and represents a rather simple series of steps. When the event completes, the code that processes completions from the kernel accesses the Completion, sets its state to store the result of the completed event, and wakes the waker the Completion had previously been storing. The latter case is more complicated.

Cancellation

Cancelling interest in an event (which any Rust code can do trivially by dropping futures that are waiting for the event to complete) has long been the tricky problem for Rust io-uring support. As I discussed previously, ringbahn conceptually passes ownership of buffers to the kernel while it is completing IO with them, so that if interest in the event is cancelled, the user’s program does not operate on those buffers and possibly cause a data race with the kernel. But if the user does cancel interest in that IO, we still want to make sure that the resources are cleaned up after the kernel is done using them. Otherwise, our program would have a memory leak.

Ringbahn ensures resources the kernel owns are cleaned up by allowing a completion to have a cancelled state. In the cancelled state, instead of holding a waker, the completion holds a Cancellation callback. This callback is a struct holding the data the kernel is meant to own, with a callback to clean up those resources when the kernel is finished with them. When an event completes, if a completion is cancelled, instead of waking up a task, the callback is called to clean up those resources.

The Event types that a submission is abstract over are responsible for defining what the cancellation callback for that event would be, but the IO abstractions make sure that completion is put into the cancelled state. Some kinds of events, which have no resources they need to clean up, construct a null callback that does nothing, whereas others will clean up resources. This doesn’t necessarily mean freeing memory, either: it could also mean returning the buffer to a pool of pre-registered buffers that are being used with io-uring’s buffer registration feature, for example.

Code

Put together, the definition of a completion is this. Note that all of thise code is internal to ringbahn: the library’s users are never responsible for operating a completion directly (that’s what the Ring type is for).

pub struct Completion {
    // The state is ManuallyDrop, because ownership is shared with the kernel
    state: ManuallyDrop<Box<Mutex<State>>>,
}

enum State {
    Submitted(Waker),
    Completed(io::Result<u32>),
    Cancelled(Cancellation),
}

impl Completion {
    // Construct a submitted completion
    pub fn new(waker: Waker) -> Completion { /*..*/ }

    // Check if the completion has actually completed
    pub fn check(self, waker: &Waker) -> Result<io::Result<u32>, Completion> { /*..*/ }

    // Cancel interest in this completion
    pub fn cancel(self, cancellation: Cancellation) { /*..*/ }
}

// Complete a given CQE
pub unsafe fn complete(cqe: CQE) { /*.. }

The complete function is unsafe, but the only safety requirement is that the user_data of the CQE passed to it actually contain the address of a completion (or null). I plan to refactor the API later so that this function is safe, and the function for setting user_data is unsafe.

Ring: driving a completion on a driver

As I mentioned previously (and intend to cover at greater length in the next post), ringbahn is abstract over a notion of a driver, which manages how IO on the io-uring is actually scheduled. The Ring type allows IO to be run on a driver by combining a Driver handle with a Completion. It tracks the state of an IO operation that is prepared, submitted, and completed on an io-uring driver.

The driver has the ability to prepare and submit events, and the ring attempts to do exactly that, tracking the state of that process as it goes. The driver is able to apply backpressure by informing the ring that it is not ready to complete these prepare and submit operations; if so, the ring will be left in the previous state, waiting to be awoken and to try again. Once the ring has submitted an event, when it awakes it will check its completion to see if it has completed.

Once an event has completed on the Ring, it is put back into its inert state, allowing it to be used again. The IO handle types like File and TcpStream all take advantage of this, whereas the Submission type (which is just supposed to run one event to completion) does not run more than one event on the ring type.

Code

The basic definition and API of the Ring type looks like this:

pub struct Ring<D: Drive> {
    state: State,
    driver: D,
}

enum State {
    Inert,
    Prepared(Completion),
    Submitted(Completion),
    Lost,
}

impl<D: Drive> Ring<D> {
    pub fn new(driver: D) -> Ring<D> { /*..*/ }

    pub fn poll(
        self: Pin<&mut Self>,
        ctx: &mut Context<'_>,
        is_eager: bool,
        prepare: impl FnOnce(&mut SubmissionQueueEvent),
    ) -> Poll<io::Result<u32>> { /*..*/ }

    pub fn cancel(&mut self, cancellation: Cancellation) { /*..*/ }
}

The main operation of the Ring type is through the poll function, which attempts to move an event through the states to completion. The Submission type, for example, works by combining a Ring with an implementation of Event, preparing that event with poll, and cancelling it if the Submission gets dropped. The implementation of types like File are more complicated, but do fundamentally the same thing by combining a Ring with a buffer type and a file descriptor and creating a handle that can run IO events repeatedly on that file descriptor.

Conclusion

I fear this has all been a bit dense, but I hope that its a decent explanation of the core state machines that make ringbahn work. Using these state machines, we are able to safely map IO handles and arbitrary events over a driver managing an io-uring instance. We do so in a way that, barring unknown implementation bugs, is 100% sound and memory safe, without ever blocking, and cleaning up resources for cancelled events.

In the next post in the series, I’ll look closer at the API that drivers need to implement, and different strategies and considerations that go into implementing drivers.