Tuesday, October 4, 2016

Metadata

As Carlsson notes, big data in not about "big" but about complexity in format and structure. We can approach the format complexity through metadata, which allows us to navigate through the data sets and to determine what they are about.

Two important requirements on experiments are replicability and reproducibility. Replicability refers to the ability to rerun the exact data experiment to produce exactly the same result; it is an aspect of governance and it is good practice to always have somebody else to check the data and its analysis before it is published. Reproducibility refers to the ability to use different data, techniques, and equipment to confirm the same result as previously obtained. We can be confident in a result only after it has been reproduced independently. These two requirements guide us to what kind of metadata we need.

There are three classes of metadata: context, syntax, and semantic.

Context of data refers to how, when, and where it was collected. The context is usually written in a lab book. If we need to replicate an analysis at a later time, the lab book might be unretrievable, therefore the context of data has to be stored with the data. This can also be a big money and time saver because some ancillary data we need for an analysis might already be available from a previous experiment; we need to be able to find it.

The syntax of data refers to the format. Analysts spend a large amount of their time wrangling data. When the format of each time series is clearly described, this tedious work can be greatly simplified. During replication and reproduction, it can also help diagnose such frequent errors like the confusion between metric and imperial units of measure. Ideally, with the data, we should also store the APIs to the data because they are part of the syntax of data.

The semantic of data refers to its meaning and is the most difficult metadata to produce. We require a unified framework that researchers in all scientific disciplines can use to create consistent, easily searchable metadata. Ease of use is paramount. Because the ability to share data is so important, we want the process of metadata creation to be as painless as possible. This means that we must start by creating an ontology for each domain in which we create data.

Ontologies evolve with time. a big challenge is to track this evolution with the metadata. For example, if we called a technique "machine learning" but then realize the term is too generic and we should call it "cluster analysis" because this is what we were doing anyway, we have to update also the old metadata. Data curation applies also to the metadata.

the evolution of terms

Some metadata can be computed from the data itself, for example, the descriptive statistics. At NASA, the automatic extraction of metadata from data content is called data archeology.