Accept-Encoding and You

I wanted to add basic HTTP compression to my blog server, and after doing a bit of research, I was surprised to learn how quirky this part of the HTTP spec is. At a high level, this is what happens:

  1. The User Agent makes a request for content and specifies a list of acceptable content encodings in the Accept-Encoding header
  2. The server parses the list of possible encodings
  3. The server chooses an encoding or falls back to identity encoding (no special encoding)
  4. The server writes the chosen encoding to the Content-Encoding header, encodes the content, and sets Content-Length to the number of bytes in the encoded content
  5. The User Agent reads the Content-Encoding header in the response, reads up to Content-Length bytes of data and decodes the data with the corresponding encoding

As with many things in HTTP, this is conceptually straightforward and builds on top of other familiar concepts, but there are a number of quirks. The first thing that gave me pause was the different semantics of the Content-Length header. Go’s net/http handles this for me, but as I was verifying the behavior in curl, I did a double-take when I noticed the discrepancy in the “size” of the content. This is only confusing because most User Agents will decode the response for you, which can make it appear as though the content exceeds the Content-Length.

The main quirk I wanted to flag is the idea of “quality values.” If you look some naive examples online, you’ll see something that amounts to:

headers['Accept-Encoding'].split(',').some(h => h.trim() == 'gzip')

Or worse:

headers['Accept-Encoding'].includes('gzip')

The spec I linked above is a fair bit more complicated than either of these examples support. In the simplest form, the first example will work since it will just be a comma-delimited list of encodings. Each encoding, however, can specify a “quality value” like the standard Accept header, which means you can technically see something more like this:

Accept-Encoding: deflate, zstd, gzip;q=0.8, br;q=0.7; identity;q=0.6

By default, we are to assume all listed encodings have a “quality” of 1.0, and the server is allowed to select these as it sees fit. If the quality is specified, than we also need to treat the quality values as weights. In this example, if the server doesn’t support deflate or zstd, it is expected to choose the highest quality encoding that is actually supported, or fall back to identity if nothing else is supported. While not rocket science, this requires a fair bit more effort to parse than the first code example, though the second will still function haphazardly.

In practice, either code example listed is good enough to support most of the HTTP wild west, even if they aren’t spec-compliant. Firefox does not specify any quality values, nor does curl in my testing. Similarly, the likelihood of a User Agent including the substring gzip in their Accept-Encoding line for any reason other than specifying that gzip is supported is almost zero. As a server author, you can only support a subset of encodings anyhow, so in the vast majority of cases you’ll probably scan left-to-right and simply pick the first encoding that is possible. You would need to be a particularly evil User Agent to intentionally send this list in the wrong order. There’s another edge case where the User Agent may forbid the identity encoding altogether by explicitly setting its quality value to 0, and I didn’t bother supporting this for my server as it makes very little sense. That would imply that there is a User Agent that cannot function without encoded content, but that is a fully optional feature in almost all cases, and content encoding is already a layer on top of identity-encoded content.

Anyhow, this is probably not relevant to the vast majority of developers, but I thought it was interesting and wanted to capture my thoughts.