Ringbahn III: A deeper dive into drivers
In the previous post in this series, I wrote about how the core state machine of ringbahn is implemented. In this post I want to talk about another central concept in ringbahn: “drivers”, external libraries which determine how ringbahn schedules IO operations over an io-uring instance.
The three phases of an io-uring event
When a user wishes to perform an IO operation using io-uring, the “event” that represents this IO goes through three distinct phases:
- Preparation: At this phase, the user prepares a “Submission Queue Event”, or “SQE” representing the IO they wish to perform. This will have an op code representing the kind of IO the user is performing, a file descriptor to perform it on (if necessary), flags, and pointers to buffers or other objects that the kernel will use while performing this IO.
- Submission: At this phase, all of the SQEs that have been prepared since the previous submit can be submitted to the kernel for processing. Submission and preparation are separate phases, so that mmany SQEs that were prepared separately can be submitted at once.
- Completion: In the final phase, the kernel has completed the requested IO, and passed the user program a “Completion Queue Event” or “CQE.” Now the user program can check the result of the IO in that CQE and proceed.
My last post described this lifecycle from the perspective of the state machine which is handling a particular IO event. Now, I want to describe the perspective from the other side of that equation: the code which handles the io-uring instance that event is occurring on.
The Drive
trait
The first two phases - preparation and submission - are handled by the driver type that implements
the Drive
trait. This is the public interface of any driver library to their users, and it is what
the rest of ringbahn consumes.
End users should not need to consume the Drive
API directly. This API is consumed by ringbahn’s
very low-level Ring
type, which implements a state machine to track an event through the three
phases of its lifecycle; and the lower-level Event
API and higher level IO objects are all built
on top of Ring
. But this is the definition of the trait:
pub trait Drive {
fn poll_prepare<'cx>(
self: Pin<&mut Self>,
ctx: &mut Context<'cx>,
count: u32,
prepare: impl FnOnce(iou::SQEs<'_>, &mut Context<'cx>) -> Completion<'cx>,
) -> Poll<Completion<'cx>>;
fn poll_submit(
self: Pin<&mut Self>,
ctx: &mut Context<'_>,
)-> Poll<io::Result<u32>>;
}
Each method corresponds to one of the first two phases of the lifecycle of an event. When a Ring
is ready to prepare an event, it will call poll_prepare
to attempt to prepare it. Note that in
ringbahn, a Ring can request to prepare more than 1 SQE at once (currently, if they are not all
hard-linked together, this will result in bizarre and incorrect program behavior). The driver will
attempt to get a sequence of SQEs of the length requested, and then call the prepare
callback with
those SQEs. If it calls the prepare callback, it must return the Completion
that callback gave it;
otherwise the program will behave incorrectly.
Secondly, when a Ring
is ready for an event to be submitted, it will call poll_submit
,
requesting that the driver submit all prepared event. The driver can do whatever it likes here
(including not actually submitting events), but from this point forward the ring will wait until the
event it has prepared is completed; if the event is not somehow submitted, any future waiting on the
ring will never make further progress.
Both of these methods are “poll” methods, and return the Poll
type. This allows the driver to
apply backpressure: if there is not space in the submission queue, or there are too many events in
flight to submit more, either method could return Poll::Pending
and store a waker somewhere to
wake when the driver is ready to proceed through that phase for this event.
The drive::complete
function
pub fn complete(cqe: iou::CQE) { /* .. */ }
The final phase, the completion phase, is handled by the driver privately. Somehow, the driver must
process all of the events in the completion queue. All it has to do in that case is call the
function drive::complete
on those events. This function, when called by the driver on a CQE
submitted using ringbahn, will successfully wake the future waiting on this event to complete. if
that interest in this event has been cancelled, it will handle the cancellation callback instead.
Note that this method is safe, and it assumes that the user_data
field of the CQE is the address
of a completion as described in the previous post. It is not safe to submit events on an io-uring
instance managed by ringbahn except through ringbahn’s APIs. The only exception to this rule is the
timeout event that is created by the timeout methods in iou
/liburing
.
Implementation strategies
Having laid out the API, the implementation strategy for a ringbahn driver is entirely up to the implementer. I’ve created this separation from ringbahn because I can imagine many successful strategies, and I’m not sure what would provide the best performance.
Importantly, the work of implementing a driver has been simplified to these three phases, and the API is safe. Users can write complex business logic using the IO objects and submission APIs in ringbahn and replace or tweak the driver code as needed to meet the performance requirements of their system.
I’ve implemented two drivers so far with very different designs. Both of them likely have bugs and are not ready to use in production, but the general idea of each driver is valid as an implementation which ringbahn can work with.
The demo driver
The demo driver is intended to be a very naive implementation which will give you basically the worst performance possible for io-uring. It uses no unsafe code, which is its only virtue.
A single global io-uring instance is used by all instances of DemoDrivers. A mutex wraps the submission queue, preventing multiple tasks from preparing or submitting events at the same time.
When user code calls poll_prepare
, the submission queue is locked and the user attempts to use
space on the submission queue to prepare its events. If there is not enough space, we attempt to
submit the events on the queue before preparing.
When user code calls poll_submit
(or needs to submit events to get space to prepare events), we
submit all prepared events, every time. The DemoDriver sets the io-uring instance to have the
NODROP
feature enabled, so that if there are too many events in flight this method will
return the EBUSY
error code. If it does, we use event-listener to put the driver
into a queue until more events have been completed and return pending.
When the IoUring is initialized, a separate thread is started to process completions. This thread holds the handle to the completion queue, and constantly blocks until at least one event has completed. This thread handles all completions, and wakes all the tasks being processed on other threads.
The maglev driver
The maglev driver on the other hand tries to be more performant. Inspired by async-io,
it exposes a block_on
function, which blocks on a future and drives the io-uring instance while
that future is pending.
In maglev, each thread which runs a future with maglev::block_on
has its own IoUring
instance,
stored in thread-local storage. If a task is awoken on a new thread, its driver will switch to using
the IoUring
instance of that new thread. This avoids us having to synchronize different threads
all operating over the same submission and completion queues.
When user code calls poll_prepare
with maglev, it first checks a thread-local semaphore tracking
how many events are currently in flight, and then checks if there is space in the submission queue
for this event. If there is not enough space in either case, it uses
event-listener to put this task into a queue that will wait until the event could
be prepared. Otherwise, it prepares the event.
Maglev does nothing when user code calls poll_submit
. Instead, it only submits IO after the future
block_on
is processing returns Pending
. When it submits IO, it also completes any events
that have finished, without blocking. If completing these tasks does not wake the future being run
with block_on
, it will block until at least one other event has completed, repeatedly, checking
each time if the future has been awoken, until there are no events to complete. If there are no
events to complete, it will simply park the thread and wait to be awoken by another thread.
This allows us to call io_uring_enter
, the syscall that underlies both submission and completion,
as few times as possible. This is the same design that underlies all of the major epoll-based
reactors (though some use a single epoll reactor shared among multiple threads, unlike maglev). For
example, combining maglev with async-executor would allow a user to implement a
workstealing multi-threaded executor in which each thread has its own IoUring instance that it
manages when it has no more work to do, with this small amount of code:
use async_executor::Executor;
use futures::future;
use once_cell::sync::Lazy;
static EXECUTOR: Lazy<Executor<'_>> = Lazy::new(|| /* ... */);
// on each thread of the executor, run the executor with maglev:
let shutdown_never = futures::future::pending::<()>();
maglev::block_on(EXECUTOR.run(shutdown_never))
You can see the implementation of smol::spawn to see how close this is to how smol is
already implemented. Literally all that would be needed is to change async_io::block_on
to
maglev::block_on
and to use ringbahn’s IO objects instead of smol’s Async
type.
Other implementations
Many other implementations are certainly possible, trading off between latency, bandwidth and other concerns. As ringbahn develops, I would be excited to see other people implement drivers, and see benchmarks comparing their performance in different circumstances.
One idea I haven’t explored, for example, is a model which integrates an io-uring instance into an application which mostly runs async IO using epoll. In this use case, the io-uring instance could be treated as an FD to perform IO, just like the other FDs being controlled with epoll. The epoll instance would return that the FD is ready as soon as events are completed on it for the user to complete.