Zombie Zen

Canceling I/O in Go Cap'n Proto

This report details an experience I had while writing an RPC system in Go. While Go’s standard I/O libraries make a great many things simple, I found cancellation to be more complex than I would have liked. Parts of this situation have improved in the last couple of Go releases (as I have noted below). I hope this positive trend continues in a way that allows the Go ecosystem to easily propagate cancellation, deadlines, and request values. My intent in this report — as well as the proposal I created back in May 2017 — is to give background and feedback to inform future design decisions. Suggestions for solutions welcome!

(Thanks to Ian Lance Taylor, Damien Neil, Cassandra Salisbury, and Andrew Bonventre for reviewing this report for accuracy and clarity.)

An Overview

For several years, I have been the maintainer of the Go Cap’n Proto library in my spare time. Cap’n Proto specifies both a binary serialization format and an RPC protocol. While this library has shown me a number of places where I think Go can improve (and this may be the first of many experience reports), I’d like to focus on a particular problem that can be explained without any knowledge of the library.

The basic building block of the RPC library is the Conn object. Simplifying a bit, there are two concurrent tasks that operate on Conn:

  1. A goroutine that reads messages from the wire, processes them, then sends back zero or more messages in response. For example, receiving an “RPC return” message consults some internal Conn state, then sends back an “RPC return acknowledged” message before reading the next message. I call this the receive goroutine.
  2. The application sends RPCs, which translate to messages being sent on the wire. The responses from the remote peer are later read by the receive goroutine.

The Closing Act

The problem at hand is how to stop a Conn’s receive loop once it has started. A naive approach would be:

However, this approach has two problems:

  1. In the Cap’n Proto RPC specification, implementations are supposed to send an explicit abort message as the last message before intentionally closing a connection. Calling Close shuts down both the reading and writing end of the io.ReadWriteCloser, making it impossible to send the abort message.
  2. Calling Close concurrently with Read on a generic io.ReadCloser is not explicitly declared safe. For example, until Go 1.9, calling Close concurrently with Read on an *os.File (such as with a Unix pipe) would result in a data race (#7970). However, types that implement net.Conn explicitly allow calling Close concurrently with Read.

Another approach could be to close the reading half of the connection first using CloseRead, with the intent to interrupt the Read. This is a bit unwieldy: CloseRead and CloseWrite are only available on TCPConn and UnixConn, and the semantics on how they interact with concurrent operations is not documented as of Go 1.9. However, Read is not the only I/O call in the receive goroutine. Remember that the receive goroutine doesn’t just read messages from the wire: it also sends them. When the CloseRead comes in, the receive goroutine may be in the middle of sending a response to an already received message. It would be desirable to stop it from sending more messages while shutting down.

This is a classic example of what Context is supposed to be used for: propagating cancellation down the call stack. Ideally, I would write:

Plumbing the Context through helper function is tedious but possible. In cases like ReadFull, I would likely have to reimplement the function. The crucial part is actually interrupting the I/O operation. io.Reader and io.Writer do not take in a Context, nor do they provide a simple way to cancel the operation. So how can I accomplish this in Go 1.9?

Starting Simple: Canceling a Write

In the scope of Cap’n Proto, cancelable writes are easy to graft on top of Context-unaware io.Writers, the reason being that partial writes corrupt the stream. A cancel signal should be ignored once bytes have hit the wire. Therefore, checking for cancelation before calling Write is enough for this use case. For writers that respect SetWriteDeadline, I can spin up a separate goroutine that listens for the Done signal and sets an immediate deadline to interrupt the Write.

Don’t Interrupt Me; I’m Reading

Canceling a read is much more complicated. Often, I want to cancel the read when there isn’t any data available. io.Readers conventionally return what is buffered instead of waiting for more, so Read returns quickly in those circumstances. For readers that implement SetReadDeadline, I can employ the same technique as for writing, but I’m left in a strange place if the io.Reader does not implement SetReadDeadline.

One way I can simulate interrupting the Read call is by always calling Read in another goroutine, and then selecting on Context.Done and a channel that produces the result of the Read call. The caveat is that at some point, the goroutine calling Read needs to be waited upon or else resources leak. In the Cap’n Proto RPC case, canceling the read will likely occur shortly before Conn closes the io.ReadWriteCloser, so any abandoned Read will not need to stay around for long. However, I still have a problem: given a generic io.ReadCloser, I cannot guarantee that it is safe to call Close concurrently with Read. There is fundamentally no way to address this: such an io.Reader cannot be stopped safely.

There’s one other wrinkle: Read can’t be wrapped in a single function like Write. Write’s contract is to block until its input is written, which means that I can gather up all that I need to write into one byte slice then call the above function. Read may intentionally return less than requested, so often multiple Read calls are necessary. This is usually handled by routines like io.ReadFull, which I don’t want to give up or duplicate to do context-plumbing. To support my existing io.Reader-based code, as well as to maintain the state of the abandoned goroutine, I had to create an io.Reader that curries the Context:

What Works

SetReadDeadline and SetWriteDeadline are flexible enough to allow me to graft cancelation and deadline awareness onto io.Readers and io.Writers. However, when I first started looking at this problem (around Go 1.7 and 1.8), this meant pipes (being *os.File) had to be excluded. Starting with Go 1.9, it is now possible to interrupt an *os.File.Read call with *os.File.Close in a safe manner, making the leaky io.Reader fallback safe for any type of file. In Go 1.10, *os.File gains the SetReadDeadline and SetWriteDeadline methods (#22114), which will make pipes work with the timeout interrupt approach. (Go 1.10 also adds os.IsTimeout.)

What Doesn’t Work

My solution works for the narrow problem of propagating Context cancellation and deadlines to specific io.Reader and io.Writer types that I needed, but at the cost of an additional goroutine and complexity. The goroutine seems unnecessary: one could imagine a version of the Go runtime poller that takes in a Done channel instead of a deadline. It’s also not intuitive that you can set an immediate deadline to interrupt a concurrent call. In the first draft of this post, I set deadlines in the future and checked for Context.Done in a loop (thanks to Ian Lance Taylor for pointing out the more efficient implementation above). On the complexity side, there’s a large amount of my custom io.Reader implementation dedicated to handling the abandon-able goroutine. The check for deadline support is fairly boilerplate, and it would be nice to eliminate it.

There’s also complexity must necessarily be pushed onto users of the library: I have to document that io.Readers passed to the connection must either have a SetReadDeadline method or be safe to call Read concurrently with Close. This stomps on one of the key benefits of Go’s I/O interfaces: anything that implements the interface should just work. Library users now must carefully inspect the io.ReadWriteClosers they pass into my RPC library. And further, it makes it harder to compose I/O types. If the user wanted to create a custom io.ReadWriteCloser that uses *bufio.Reader on another io.Reader, they have to know to wrap SetReadDeadline as well, and it has to “poke through” *bufio.Reader to the underlying io.Reader’s SetReadDeadline. If the I/O operations had a standard way of propagating Context, then this wouldn’t be necessary.

A Note on Context values

One further observation: the solution I used in Cap’n Proto does not work for Context values. If I needed to propagate a Context value into the io.ReadWriteCloser (say for observability purposes), then I would be out of luck. While I don’t have a concrete need for this within this library, I have seen other places where this would be useful — notably in Google Cloud Storage’s storage.Reader and storage.Writer. The GCS package works around the lack of Context by currying the Context, similar to what this package does. It would be nice to address this use case, but I haven’t thought about it in as much depth.