Tuesday, March 22, 2011

Little structured data

Today we are mostly interested in large data sets, like the megaimages we mentioned recently. Moreover, we are happy with flat unstructured data, which we comb and mine as needed. Personally, I prefer navigation and structure, but that is a matter of taste. Anyway, what is the trend for little data?

When we roll back the time machine to the days of computer assisted instruction (CAI), one of the early systems was Plato. A reason for its failure was that students tended to get lost in its unstructured graph of lessons. This problem was solved in Nievergelt's XS-1 by using a tree data structure, with the property of having a root and only one path between any two nodes.

In those days, structure was particularly important in documents, because the typewriter had let authors write in free style, while for technical documents a regimented style is much more efficient. This was first achieved through style sheets, like in TeX and Cedar's Tioga, then through more comprehensive systems like SGML, where content was decoupled and independent of appearance, with a dictionary encoding the enforced document structure. Most computer manuals where written in SGML, often using the FrameMaker application.

Then came the Web with HTML. The marriage of SGML with HTML begot XML. When the researchers at Sun were looking for a system-independent means to exchange data among Java applications running on a heterogeneous network of machines, they chose XML, and Java was quickly bestowed with rich XML manipulation libraries.

It did not take long to realize that an XML Data Type Dictionary (DTD) had a strong similitude to the schema of relational databases, and the jump from XML with DTD to XML with schema allowed for the system independent exchange of databases.

Unfortunately, this requires substantial parsing and can be slow on the mobile devices that are the platform of choice today. An alternative method for exchanging a database is to dump it as a sequence of SQL commands, which can be fetched from a completely different system and loaded by simply executing the dump file.

The issue for mobile devices like smart phone is footprint, because large software components eat too much battery power. Today's relational database system du jour is SQLite. It was invented in 2000 by Richard Hipp of General Dynamics, who implemented it on HP-UX for deployment on board guided missile destroyer ships.

With a footprint of only about 275 KB, it can fit everywhere a little structured data is needed, and with the whole database in a single platform-independent file, the data is easy to share, especially when you dump it as a sequence of SQL commands.

No wonder, its is now showing up under the hood in so many places. Here is a small sample:

  • Mozilla applications for storing bookmarks, cookies, contacts etc.
  • Lightroom and Aperture for storing the photograph metadata
  • Skype, iTunes, Apple Mail, Opera, McAffe Antivirus, etc.

SQLite is so small, it has been integrated in many systems, so it is available without requiring libraries and software installations: HP's webOS, Apple's Core Data, Adobe's Integrated Runtime (AIR), PHP, Python, Symbian, Maemo, Android, BlackBerry, and many more. It is open source and can be downloaded from here.