Two Memory Bugs From Ringbahn

While implementing ringbahn, I introduced at least two bugs that caused memory safety errors, resulting in segfaults, allocator aborts, and bizarre undefined behavior. I’ve fixed both bugs that I could find, and now I have no evidence that there are more memory safety issues in the current codebase (though that doesn’t mean there aren’t, of course). I wanted to write about both of these bugs, because they had an interesting thing in common: they were both caused by destructors.

Bug #1: Destructors running after assignment (double free)

This one is a fairly classic bug, that anyone writing unsafe code should know about. The buggy code looked like this:

let data = self.read_buf.buf.as_mut_ptr();
let len = self.read_buf.buf.len();
self.completion.cancel(Cancellation::buffer(data, len));
self.read_buf = Buffer::new();

This code is a part of the implementation of cancellation on the Ring type (the API has since changed). In ringbahn, if IO is currently running which is using a buffer, but the program cancels its interest in that IO, the completion for that IO is handed a cancellation object, which will be used to clean up that buffer when the IO completes. This code constructs a cancellation, passes it the completion, and then replaces the buffer with a new buffer.

That sounds good, but the problem occurs because of the semantics of assignment. When a field is reassigned in Rust, the destructor is run on the previous value of that field. As a result, the buffer we have passed to the cancellation is freed immediately in this code when we reassign the read_buf field. Then, when the IO completes, we free that buffer again, creating a double free.

The same bug occurred twice in the file: a similar bit of code cancelled interest in read events in the same way.

Solution: ptr::write

The solution is to replace the last line with a call to ptr::write:

let data = self.read_buf.buf.as_mut_ptr();
let len = self.read_buf.buf.len();
self.completion.cancel(Cancellation::buffer(data, len));
ptr::write(&mut self.read_buf, Buffer::new());

The function ptr::write behaves a lot like an assignment: it writes the value of second argument to the address passed as the first argument. However, unlike assignment, it does not run the destructor of the previous value. This is the biggest and most important difference between ptr::write and an assignment.

It’s unintuitive, but this is a very important trick to remember when writing unsafe code: if you want to reassign a value, but you don’t want the destructor to run on the previous value, you need to use ptr::write!

Bug #2: Destructors referencing a freed object (use after free)

The second bug occurred in this bit of code:

let mut state = self.state.as_ref().lock();
if matches!(&*state, State::Completed(_)) {
    callback.cancel();
    self.deallocate();
} else {
    *state = State::Cancelled(callback);
}

This code implements the Completion::cancel method that we called in the previous code sample. A completion’s state field is a NonNull<Mutex<State>>, pointing to an enum representing the state of the completion. When a completion is cancelled, we acquire a lock on the completion and then check if it has completed. If it has not completed, we store the callback there to run when it has completed. But if it has completed (meaning the IO completed concurrently with our attempt to cancel interest in it), we call the callback (with its cancel method) and then deallocate the completion, cleaning up all resources related to this IO event.

The problem is that the state variable is a MutexGuard, wrapping our locked access to the state of our completion. The destructor for this state variable will not run until the function completes, and when it does it will mutate the state of the Mutex it has locked. But we have deallocated that mutex when we call self.deallocate. This means that when the destructor goes to unlock the mutex, it is a use after free, mutating arbitrary state that could have been used for some other purpose at this point.

This bug also occurred twice: in another function in the completion module, we similarly deallocated the completion before dropping the mutexguard.

Solution: mem::drop

The solution is to insert a call to mem::drop before deallocating the completion, that way the destructor is guaranteed to run before the mutex is deallocated. Now the code looks like this:

let mut state = self.state.as_ref().lock();
if matches!(&*state, State::Completed(_)) {
    callback.cancel();
    drop(state);
    self.deallocate();
} else {
    *state = State::Cancelled(callback);
}

This sequences the destructor in the correct order, so that the write to the mutex state occurs before the mutex is freed.

Technically, we could equally well mem::forget the MutexGuard: since we are deallocating the Mutex entirely, we know no other attempts to acquire the lock will occur, and unlocking it is strictly wasted effort. This thought only occurred to me while writing this blog post.

What could be done about this?

It was interesting to me that both of the memory bugs I’ve found in my code occurred because of destructors running. In a certain sense this is not surprising: it’s one thing to reason about the code in front of you, but its quite another thing to reason about the code that is implicitly inserted into your program for you by the compiler.

In safe code, destructors are awesome, and one of Rust’s biggest superpowers: as Yehuda Katz wrote a long time ago, the ability to just not worry about resource clean up in most cases is awesome. But unsafe code is a different beast, where the guarantees about aliasing get quite mucked up. It would be great to have a lint that I could turn on in some scopes to warn me if any destructors are being inserted into my code.

(Incidentally, for the type theory fans, this lint would effectively turn Rust’s “affine” types into “linear” types. I think Rust’s choice to make non-copyable types “affine” was the correct choice in general, but this demonstrates that in certain contexts the extra check of linearity is very valuable.)