May 28th, 2005

Choosing a data format

It’s no secret that I hate just about every data format out there. SQL is non-portable, XML is usually Wrong, abused and badly laid out, nor is it appendable; YAML doesn’t include enough structure. Text files are usually too ad-hoc, and hard to index. RDF is beautiful, but hard to parse.

If you need microformat interoperability — that is, the ability to yank out little pieces of information from larger documents or data structures, look at XML and especially RDF/XML, and take note of existing (meta)data standards like Dublin Core. RDF has a whole suite of well specified semantic types, which are importable by reference to the specs. Parsing is exeedingly well defined, but hell to do in an ad-hoc way. This plays nice with some DTD-based XML, and mosst DTD-less XML. The lack of easy-to-append isn’t often a problem in cases like this, especially with “documents” that are usually updated whole, not streams of data. XML is good at corruption detection because of well-formedness rules.

If you need streams of data, look at separate files for each entry, or a text file you can append to. If indexing or querying is important, look at SQL, too — Sometimes it’s really the best tool for the job. SQL lets you formally specify data types, which is good, but relations are left up to the query, which can be bad, as there’s no universal namespace, so data formats end up being very ad-hoc. Binary storage like SQL isn’t resilient in the case of format upgrades or disk errors, so keep backups. Good ones.

If you want to import the semantics of various IETF RFCs, look at MIME-formatted files, and at using HTTP as a transport. There’s relatively easy conversions between mail and web, and parsers are very easy to implement. Files are text files, so with a human to go through the data, even things like having sectors missing from your disk may not render the data entirely unusable. Searching with linear search programs like grep is easy, and if metadata is in the header of the data, it’s relatively easy to match as well.

If you want to mark up arbitrary text with arbitrary annotations, look at XML with mixed-content DTDs like XHTML. Mixing sets of annotations is relatively easy, for many tasks, and the ability to just mark relevant bits of text with semantic annotation makes for very strong, parseable documents. Pure XHTML 1.0, 1.1 and 2.0 are all easily indexed as well with crawlers like Google, Lucene and HT://Dig.

If you need simple config files, first see if your implementation language serves as a suitable choice itself, if not, look at YAML — its utter simplicity at representing hashes and lists makes it a natural choice for very simple config data. (I find that most software configuration can be simplified to a hash.)

If you’re dealing with encryption, look into the bulk of ASN.1 instead of XML, or at least get a handle on treating data with binary exactness, and not being loose with white-space and newline translations. Have some idea what your canonical format is.

If you’re building compound documents, especially on the Macintosh, take a hint from MacOS and use a directory instead of a file. OS support for directories as opaque objects is getting better.

In all these cases, figure out what your atomic edit is — if an entire document gets saved at once, XML might be fine. If edits are always appends, XML is going to be ugly. If you need random-access and concurrent updates, look at one file per entry, or SQL, since concurrency is well supported in these cases.