Two Memory Bugs From Ringbahn
While implementing ringbahn, I introduced at least two bugs that caused memory safety errors, resulting in segfaults, allocator aborts, and bizarre undefined behavior. I’ve fixed both bugs that I could find, and now I have no evidence that there are more memory safety issues in the current codebase (though that doesn’t mean there aren’t, of course). I wanted to write about both of these bugs, because they had an interesting thing in common: they were both caused by destructors.
Bug #1: Destructors running after assignment (double free)
This one is a fairly classic bug, that anyone writing unsafe code should know about. The buggy code looked like this:
let data = self.read_buf.buf.as_mut_ptr();
let len = self.read_buf.buf.len();
self.completion.cancel(Cancellation::buffer(data, len));
self.read_buf = Buffer::new();
This code is a part of the implementation of cancellation on the Ring type (the API has since changed). In ringbahn, if IO is currently running which is using a buffer, but the program cancels its interest in that IO, the completion for that IO is handed a cancellation object, which will be used to clean up that buffer when the IO completes. This code constructs a cancellation, passes it the completion, and then replaces the buffer with a new buffer.
That sounds good, but the problem occurs because of the semantics of assignment. When a field is
reassigned in Rust, the destructor is run on the previous value of that field. As a result, the
buffer we have passed to the cancellation is freed immediately in this code when we reassign the
read_buf
field. Then, when the IO completes, we free that buffer again, creating a double free.
The same bug occurred twice in the file: a similar bit of code cancelled interest in read events in the same way.
Solution: ptr::write
The solution is to replace the last line with a call to ptr::write
:
let data = self.read_buf.buf.as_mut_ptr();
let len = self.read_buf.buf.len();
self.completion.cancel(Cancellation::buffer(data, len));
ptr::write(&mut self.read_buf, Buffer::new());
The function ptr::write
behaves a lot like an assignment: it writes the value of second argument
to the address passed as the first argument. However, unlike assignment, it does not run the
destructor of the previous value. This is the biggest and most important difference between
ptr::write
and an assignment.
It’s unintuitive, but this is a very important trick to remember when writing unsafe code: if you
want to reassign a value, but you don’t want the destructor to run on the previous value, you need
to use ptr::write
!
Bug #2: Destructors referencing a freed object (use after free)
The second bug occurred in this bit of code:
let mut state = self.state.as_ref().lock();
if matches!(&*state, State::Completed(_)) {
callback.cancel();
self.deallocate();
} else {
*state = State::Cancelled(callback);
}
This code implements the Completion::cancel
method that we called in the previous code sample. A
completion’s state
field is a NonNull<Mutex<State>>
, pointing to an enum representing the state
of the completion. When a completion is cancelled, we acquire a lock on the completion and then
check if it has completed. If it has not completed, we store the callback there to run when it has
completed. But if it has completed (meaning the IO completed concurrently with our attempt to
cancel interest in it), we call the callback (with its cancel
method) and then deallocate the
completion, cleaning up all resources related to this IO event.
The problem is that the state
variable is a MutexGuard
, wrapping our locked access to the state
of our completion. The destructor for this state
variable will not run until the function
completes, and when it does it will mutate the state of the Mutex it has locked. But we have
deallocated that mutex when we call self.deallocate
. This means that when the destructor goes to
unlock the mutex, it is a use after free, mutating arbitrary state that could have been used for
some other purpose at this point.
This bug also occurred twice: in another function in the completion module, we similarly deallocated the completion before dropping the mutexguard.
Solution: mem::drop
The solution is to insert a call to mem::drop
before deallocating the completion, that way the
destructor is guaranteed to run before the mutex is deallocated. Now the code looks like this:
let mut state = self.state.as_ref().lock();
if matches!(&*state, State::Completed(_)) {
callback.cancel();
drop(state);
self.deallocate();
} else {
*state = State::Cancelled(callback);
}
This sequences the destructor in the correct order, so that the write to the mutex state occurs before the mutex is freed.
Technically, we could equally well mem::forget
the MutexGuard: since we are deallocating the Mutex
entirely, we know no other attempts to acquire the lock will occur, and unlocking it is strictly
wasted effort. This thought only occurred to me while writing this blog post.
What could be done about this?
It was interesting to me that both of the memory bugs I’ve found in my code occurred because of destructors running. In a certain sense this is not surprising: it’s one thing to reason about the code in front of you, but its quite another thing to reason about the code that is implicitly inserted into your program for you by the compiler.
In safe code, destructors are awesome, and one of Rust’s biggest superpowers: as Yehuda Katz wrote a long time ago, the ability to just not worry about resource clean up in most cases is awesome. But unsafe code is a different beast, where the guarantees about aliasing get quite mucked up. It would be great to have a lint that I could turn on in some scopes to warn me if any destructors are being inserted into my code.
(Incidentally, for the type theory fans, this lint would effectively turn Rust’s “affine” types into “linear” types. I think Rust’s choice to make non-copyable types “affine” was the correct choice in general, but this demonstrates that in certain contexts the extra check of linearity is very valuable.)