Notes on io-uring

May 6, 2020

Last fall I was working on a library to make a safe API for driving futures on top of an an io-uring instance. Though I released bindings to liburing called iou, the futures integration, called ostkreuz, was never released. I don’t know if I will pick this work up again in the future but several different people have started writing other libraries with similar goals, so I wanted to write up some notes on what I learned working with io-uring and Rust’s futures model. This post assumes some level of familiarity with the io-uring API. A high level overview is provided in this document.

First of all: soundness of all safe public APIs is the mandatory minimum for a Rust library. Unsoundness is a nonstarter, and should not be considered as a possible option. If its actually impossible to get a sufficient performance profile within the soundness guarantees of Rust, users should by writing the sound version and then working with the Rust project to improve Rust so that they can create sound APIs with the performance profile they need. However, I do not believe that io-uring will actually need any changes to Rust to accomodate writing high performance programs against it.

The problem: completion, cancellation and buffer management

First, let me review the tricky problem integrating io-uring and Rust.

There are two kinds of APIs for non-blocking IO: readiness and completion. In a readiness API, the program asks the OS to tell it when an IO handle is ready to perform IO, and then the program performs the IO (which does not block, because the handle is ready). In a completion API, the program asks the OS to perform IO on some handle, and receives notice from the OS when the IO is complete. epoll is a readiness API, whereas io-uring is a completion API.

In Rust, futures are supposed to be implicitly cancellable, by simply never polling them again. This works well with readiness-based APIs, because you can trivially ignore the fact that IO is ready for a cancelled future. But if you pass a buffer to have IO performed into it by a completion-based API, the kernel will write to or read from that buffer even if you cancel the future.

This means that if you pass a slice to be used as a buffer for IO and then cancel the future waiting on that IO, you cannot use that slice again until the kernel has actually completed. To put this in code:

// This future will wait on a read from `file` into `buffer`
let future = io_uring.read(&mut file, &mut buffer[..]);

// Cancel the future: we decide we don't care about this IO:
drop(future);

// Oops, this is a data race with the kernel! The kernel
// is going to try to write into the buffer, because cancelling
// that future did not actually cancel the IO:
buffer.copy_from_slice(..);

Therefore, we need some way to ensure that the kernel has logically borrowed the future until the kernel has completed its IO, no matter what happens in the userspace program. This is the tricky problem.

Blocking on drop does not work

One common solution proposed is to block on the completion of the IO in the destructor of the future. That way, if the future is cancelled, it will block until the IO is actually completed. This is unsound and a non-starter.

Any object can be leaked in Rust, trivially and safely, making it unsound to rely on a destructor running at the end of a lifetime. As one of the most informed people in the world about Rust’s rules around leaking memory and what users can rely on for soundness, I can promise confidently that users will never be able to rely on destructors running. There is no way to make this work.

Moreover, I think it is an incredibly bad idea to attempt to rely on it, even if you accepted the unsoundness (or made the unsoundness “correct” by, e.g., making constructing the future unsafe). Blocking in the destructor seems to be based on the assumption that you don’t actually want to cancel these tasks, but users do want that.

If you block the whole thread in the destructor (really the only thing possible in Rust today), you are now blocking an entire thread on this IO: a terrible performance regression. But even with async destructors, you are blocking this task on the IO completing. But the user’s code on top of your library has cancelled its interest in this IO. Right now, a lot of people are looking at io-uring for things like file system IO, where blocking on this IO might seem reasonable. But io-uring is the future of all IO on Linux, including network IO. If you block the task on completing the IO, timeouts simply stop working.

(This could be mitigating by submitting a cancellation to the io-uring, which will hopefully cancel the kernel IO in a timely manner. But now you’re submitting more events and performing more syscalls to cancel this future, making cancellation still unnecessarily expensive.)

The only performant way to handle cancellation that I can see is this: io-uring will necessarily allocate a waker for the completion event, to trigger a wake by the CQE. The future must also have access to this allocated waker, where it can register that the task does not care about this future any more, so that the CQE processing code will not wake the task when the IO completes. Possibly, the drop could also submit a cancellation notice to the kernel (though I would at least consider only opportunistically submitting the cancellation as a rider on other submissions).

The kernel must own the buffer

Last August, Taylor Cramer made a post proposing that pinning the buffer could be a way to make this work. The idea is this: using a custom buffer type which does not implement Unpin, the buffer would be guaranteed to be dropped before it is invalidated. It’s important to note that this is not the same as being guaranteed to be dropped: its still possible it won’t be dropped, but the memory backing the buffer will never be freed if it isn’t dropped. The buffer type would have a destructor which, in Taylor’s words, “deregisters itself” - this isn’t actually possible, so I think what Taylor would’ve meant to say is that the buffer’s destructor, rather than the future’s, blocks on the completion of this task.

However, this is not a solution to the problem. It is not sufficient that the buffer not be invalidated until the IO is complete. While the kernel writing into freed memory would be very bad, it is not the only way this can go wrong. If the user drops the future, they still have the handle to the buffer, into which they can read and write from userspace. This is also a data race with the kernel IO which wants to use that buffer. It’s not sufficient to guarantee that the buffer isn’t dropped, we must also guarantee that the kernel has exclusive access to the buffer.

Logical ownership is the only way to make this work in Rust’s current type system: the kernel must own the buffer. There’s no sound way to take a borrowed slice, pass it to the kernel, and wait for the kernel to finish its IO on it, guaranteeing that the concurrently running user program will not access the buffer in an unsynchronized way. Rust’s type system has no way to model the behavior of the kernel except passing ownership. I would strongly encourage everyone to move to an ownership based model, because I am very confident it is the only sound way to create an API.

And also, this is actually advantageous. io-uring has a lot of APIs, which are growing in number and complexity, designed around allowing the kernel to manage the buffers for you. Passing the buffers by ownership allows us to access these APIs, and will be the most high performance solution in the long term anyway. Let’s accept that the kernel owns the buffers, and design high performance APIs on top of that interface.

Copying buffers and IO traits

There is one big elephant in the room that I haven’t mentioned yet: the IO traits. Read, Write, and by extension, AsyncRead and AsyncWrite, are all based around an API in which the caller manages the buffers, and the IO object just performs IO into them. This is not consistent with the kernel owning the buffers. The only way to implement these APIs safely is to manage a separate set of buffers, and then copy into the buffers passed in. This is an extra, unnecessary, memcpy of the data being passed in. That’s not great.

And yet, I advocated (knowing this) in my async interview that we merge AsyncRead and AsyncWrite as is into std. Why? The reason is simple: AsyncRead and AsyncWrite, like their sync counterparts, represent the interfae of reading and writing with a caller-managed buffer. If the only way to manage this safely with some underlying OS interface is to perform an extra copy, so be it. It works out fine, because there is another interface intended to for use with callee-managed buffers: AsyncBufRead.

The AsyncBufRead trait perfectly describes the behavior of reading on io-uring: the object you are reading from also manages the buffer for you, and just gives you a reference into its buffer when you want it. Now that io-uring presents a compelling motivation, we could also have provide a BufWrite and AsyncBufWrite to represent the write-side of this operation.

With both Linux and Windows favoring a completion based API at the lowest level (libraries like mio already use buffer pools on Windows for this exact reason), which necessitates managing the buffers either in the kernel or as near to the kernel as possible, we will probably see frameworks move toward using buffered IO interfaces for the maximal performance. This is fine: we have both interfaces in std already for a reason. They express different use cases, and in some domains at some times one use case dominates over the other.

Of course, maybe some users do want to control the buffers at a higher level, somewhere in the caller, because they get some other optimization that way. This is inherently in tension with the optimizations you get from letting the kernel control the buffers: we can’t trivially let both parties control the lifecycle of the buffers. Possibly, io-uring libraries will be able to recover whatever optimizations those users need by exposing additional APIs.

There are more interesting questions ahead

So I think this is the solution we should all adopt and move forward with: io-uring controls the buffers, the fastest interfaces on io-uring are the buffered interfaces, the unbuffered interfaces make an extra copy. We can stop being mired in trying to force the language to do something impossible. But there are still many many interesting questions ahead.

io-uring allows a huge range of flexibility in how the io is actually managed. Do you have a single thread managing all of your completions, or do you manage completions opportunistically as you submit events? Should we io-uring only for file system IO and wait for completions on an epoll instance, or move everything to io-uring? How do we integrate will with libraries that are using epoll still? How do you want to sequence io events together (io-uring provides multiple ways)? Do you have a single io-uring for your program, or many? Are io-uring timeouts better than user space timeouts?

I hope in the long run we make it easy for end users to make choices along these lines, providing builders for constructing a reactor that has the behavior they need for their specific use case. Until we figure it out, exciting times are ahead for async IO on Linux.