Ringbahn III: A deeper dive into drivers

In the previous post in this series, I wrote about how the core state machine of ringbahn is implemented. In this post I want to talk about another central concept in ringbahn: “drivers”, external libraries which determine how ringbahn schedules IO operations over an io-uring instance.

The three phases of an io-uring event

When a user wishes to perform an IO operation using io-uring, the “event” that represents this IO goes through three distinct phases:

  1. Preparation: At this phase, the user prepares a “Submission Queue Event”, or “SQE” representing the IO they wish to perform. This will have an op code representing the kind of IO the user is performing, a file descriptor to perform it on (if necessary), flags, and pointers to buffers or other objects that the kernel will use while performing this IO.
  2. Submission: At this phase, all of the SQEs that have been prepared since the previous submit can be submitted to the kernel for processing. Submission and preparation are separate phases, so that mmany SQEs that were prepared separately can be submitted at once.
  3. Completion: In the final phase, the kernel has completed the requested IO, and passed the user program a “Completion Queue Event” or “CQE.” Now the user program can check the result of the IO in that CQE and proceed.

My last post described this lifecycle from the perspective of the state machine which is handling a particular IO event. Now, I want to describe the perspective from the other side of that equation: the code which handles the io-uring instance that event is occurring on.

The Drive trait

The first two phases - preparation and submission - are handled by the driver type that implements the Drive trait. This is the public interface of any driver library to their users, and it is what the rest of ringbahn consumes.

End users should not need to consume the Drive API directly. This API is consumed by ringbahn’s very low-level Ring type, which implements a state machine to track an event through the three phases of its lifecycle; and the lower-level Event API and higher level IO objects are all built on top of Ring. But this is the definition of the trait:

pub trait Drive {
    fn poll_prepare<'cx>(
        self: Pin<&mut Self>,
        ctx: &mut Context<'cx>,
        count: u32,
        prepare: impl FnOnce(iou::SQEs<'_>, &mut Context<'cx>) -> Completion<'cx>,
    ) -> Poll<Completion<'cx>>;

    fn poll_submit(
        self: Pin<&mut Self>,
        ctx: &mut Context<'_>,
    )-> Poll<io::Result<u32>>; 

Each method corresponds to one of the first two phases of the lifecycle of an event. When a Ring is ready to prepare an event, it will call poll_prepare to attempt to prepare it. Note that in ringbahn, a Ring can request to prepare more than 1 SQE at once (currently, if they are not all hard-linked together, this will result in bizarre and incorrect program behavior). The driver will attempt to get a sequence of SQEs of the length requested, and then call the prepare callback with those SQEs. If it calls the prepare callback, it must return the Completion that callback gave it; otherwise the program will behave incorrectly.

Secondly, when a Ring is ready for an event to be submitted, it will call poll_submit, requesting that the driver submit all prepared event. The driver can do whatever it likes here (including not actually submitting events), but from this point forward the ring will wait until the event it has prepared is completed; if the event is not somehow submitted, any future waiting on the ring will never make further progress.

Both of these methods are “poll” methods, and return the Poll type. This allows the driver to apply backpressure: if there is not space in the submission queue, or there are too many events in flight to submit more, either method could return Poll::Pending and store a waker somewhere to wake when the driver is ready to proceed through that phase for this event.

The drive::complete function

pub fn complete(cqe: iou::CQE) { /* .. */ }

The final phase, the completion phase, is handled by the driver privately. Somehow, the driver must process all of the events in the completion queue. All it has to do in that case is call the function drive::complete on those events. This function, when called by the driver on a CQE submitted using ringbahn, will successfully wake the future waiting on this event to complete. if that interest in this event has been cancelled, it will handle the cancellation callback instead.

Note that this method is safe, and it assumes that the user_data field of the CQE is the address of a completion as described in the previous post. It is not safe to submit events on an io-uring instance managed by ringbahn except through ringbahn’s APIs. The only exception to this rule is the timeout event that is created by the timeout methods in iou/liburing.

Implementation strategies

Having laid out the API, the implementation strategy for a ringbahn driver is entirely up to the implementer. I’ve created this separation from ringbahn because I can imagine many successful strategies, and I’m not sure what would provide the best performance.

Importantly, the work of implementing a driver has been simplified to these three phases, and the API is safe. Users can write complex business logic using the IO objects and submission APIs in ringbahn and replace or tweak the driver code as needed to meet the performance requirements of their system.

I’ve implemented two drivers so far with very different designs. Both of them likely have bugs and are not ready to use in production, but the general idea of each driver is valid as an implementation which ringbahn can work with.

The demo driver

The demo driver is intended to be a very naive implementation which will give you basically the worst performance possible for io-uring. It uses no unsafe code, which is its only virtue.

A single global io-uring instance is used by all instances of DemoDrivers. A mutex wraps the submission queue, preventing multiple tasks from preparing or submitting events at the same time.

When user code calls poll_prepare, the submission queue is locked and the user attempts to use space on the submission queue to prepare its events. If there is not enough space, we attempt to submit the events on the queue before preparing.

When user code calls poll_submit (or needs to submit events to get space to prepare events), we submit all prepared events, every time. The DemoDriver sets the io-uring instance to have the NODROP feature enabled, so that if there are too many events in flight this method will return the EBUSY error code. If it does, we use event-listener to put the driver into a queue until more events have been completed and return pending.

When the IoUring is initialized, a separate thread is started to process completions. This thread holds the handle to the completion queue, and constantly blocks until at least one event has completed. This thread handles all completions, and wakes all the tasks being processed on other threads.

The maglev driver

The maglev driver on the other hand tries to be more performant. Inspired by async-io, it exposes a block_on function, which blocks on a future and drives the io-uring instance while that future is pending.

In maglev, each thread which runs a future with maglev::block_on has its own IoUring instance, stored in thread-local storage. If a task is awoken on a new thread, its driver will switch to using the IoUring instance of that new thread. This avoids us having to synchronize different threads all operating over the same submission and completion queues.

When user code calls poll_prepare with maglev, it first checks a thread-local semaphore tracking how many events are currently in flight, and then checks if there is space in the submission queue for this event. If there is not enough space in either case, it uses event-listener to put this task into a queue that will wait until the event could be prepared. Otherwise, it prepares the event.

Maglev does nothing when user code calls poll_submit. Instead, it only submits IO after the future block_on is processing returns Pending. When it submits IO, it also completes any events that have finished, without blocking. If completing these tasks does not wake the future being run with block_on, it will block until at least one other event has completed, repeatedly, checking each time if the future has been awoken, until there are no events to complete. If there are no events to complete, it will simply park the thread and wait to be awoken by another thread.

This allows us to call io_uring_enter, the syscall that underlies both submission and completion, as few times as possible. This is the same design that underlies all of the major epoll-based reactors (though some use a single epoll reactor shared among multiple threads, unlike maglev). For example, combining maglev with async-executor would allow a user to implement a workstealing multi-threaded executor in which each thread has its own IoUring instance that it manages when it has no more work to do, with this small amount of code:

use async_executor::Executor;
use futures::future;
use once_cell::sync::Lazy;

static EXECUTOR: Lazy<Executor<'_>> = Lazy::new(|| /* ... */);

// on each thread of the executor, run the executor with maglev:
let shutdown_never = futures::future::pending::<()>();

You can see the implementation of smol::spawn to see how close this is to how smol is already implemented. Literally all that would be needed is to change async_io::block_on to maglev::block_on and to use ringbahn’s IO objects instead of smol’s Async type.

Other implementations

Many other implementations are certainly possible, trading off between latency, bandwidth and other concerns. As ringbahn develops, I would be excited to see other people implement drivers, and see benchmarks comparing their performance in different circumstances.

One idea I haven’t explored, for example, is a model which integrates an io-uring instance into an application which mostly runs async IO using epoll. In this use case, the io-uring instance could be treated as an FD to perform IO, just like the other FDs being controlled with epoll. The epoll instance would return that the FD is ready as soon as events are completed on it for the user to complete.