November 3rd, 2005

Efficiently implementing content-negotiation

The only feature of Apache that I miss using Lighttpd is content negotiation.

In a nutshell, content negotiation takes an abstract resource URL like http://example.org/2005/chart and maps it to the files on the filesystem based on the available files and their mime-types, and the mime-types in the requestor’s Accept: header.

Given that URL, an Accept: header suggesting image/svg+xml; q=1, image/*; q=0.5 and the files /www/example.org/2005/chart.png and /www/example.org/2005/chart.svg, the server would see that there is a image/svg type file, which matches the highest preference, and return that along with a Varies: Accept header.

The efficiency problems come from needing to know the available files and their mime-types. At the most efficient, an expensive scan for available files will happen for one hit, and be cached for subsequent hits. However, cache consistency is a difficult problem, and many of the solutions are as inefficient as no caching at all. Very recent linux kernels support the inotify mechanism which would work to monitor efficiently and keep the cache consistent, but it’s not a generally portable solution.

The simplest implementation would take the URL, and check to see if it’s immediately satisfiable — this is the same efficiency as normal serving, without content-negotiation. If it’s not found, then ir must perform a directory listing (one open call, some read calls). This gets expensive for huge directories. (Directories of over 1000 files, though the expense depends on the type of filesystem). Candidates are selected, mime-types mapped, and selected according to the criteria in the HTTP spec. Unless there are extremely many alternatives or an absurdly large Accept: header, computing this isn’t computationally intensive, on the order of O(m * n).

However, to send Content-Length: headers, at least one stat() call must be made, and to handle dangling symbolic links, a stat() for every file under consideration (though since dangling links are an edge case, this could be implemented as a fallback, not normal operation.).

The biggest issues are the ones dealing with unusually large directories, where a linear scan of the listing can take a long time, and if caching is performed, how to keep cache consistency and still gain from the cache.

Thoughts are always welcome. I’ll probably implement this in Lighttpd at some point.