The Mostly Color Channel: Metadata

Tuesday, October 4, 2016

Metadata

As Carlsson notes, big data in not about "big" but about complexity in format and structure. We can approach the format complexity through metadata, which allows us to navigate through the data sets and to determine what they are about.

Two important requirements on experiments are replicability and reproducibility. Replicability refers to the ability to rerun the exact data experiment to produce exactly the same result; it is an aspect of governance and it is good practice to always have somebody else to check the data and its analysis before it is published. Reproducibility refers to the ability to use different data, techniques, and equipment to confirm the same result as previously obtained. We can be confident in a result only after it has been reproduced independently. These two requirements guide us to what kind of metadata we need.

There are three classes of metadata: context, syntax, and semantic.

Context of data refers to how, when, and where it was collected. The context is usually written in a lab book. If we need to replicate an analysis at a later time, the lab book might be unretrievable, therefore the context of data has to be stored with the data. This can also be a big money and time saver because some ancillary data we need for an analysis might already be available from a previous experiment; we need to be able to find it.

The syntax of data refers to the format. Analysts spend a large amount of their time wrangling data. When the format of each time series is clearly described, this tedious work can be greatly simplified. During replication and reproduction, it can also help diagnose such frequent errors like the confusion between metric and imperial units of measure. Ideally, with the data, we should also store the APIs to the data because they are part of the syntax of data.

The semantic of data refers to its meaning and is the most difficult metadata to produce. We require a unified framework that researchers in all scientific disciplines can use to create consistent, easily searchable metadata. Ease of use is paramount. Because the ability to share data is so important, we want the process of metadata creation to be as painless as possible. This means that we must start by creating an ontology for each domain in which we create data.

Ontologies evolve with time. a big challenge is to track this evolution with the metadata. For example, if we called a technique "machine learning" but then realize the term is too generic and we should call it "cluster analysis" because this is what we were doing anyway, we have to update also the old metadata. Data curation applies also to the metadata.

Some metadata can be computed from the data itself, for example, the descriptive statistics. At NASA, the automatic extraction of metadata from data content is called data archeology.

No comments:

Post a Comment

About this blog

The Internet is an amalgam of forms blurred under epistemological pressures. In Søren Kierkegaard’s words, under this flat shower of leveled information, where everybody is interested in everything and nothing is too trivial or too important, people just accumulate information and postpone decisions indefinitely, i.e., nobody takes action and nobody is responsible for truth — there is no mastery, just gossip. He called this the æsthetic sphere of existence, exhorting us to evolve to the ethical sphere, where we do not just accumulate information but take action and make commitments. Blogs are instruments to overcome flatness by creating opportunities for vertical activities. In this sense this blog is a view from my window — a collection of tidbits I judged relevant to computational color science and in general to the promotion of scientific excellence in areas of strategic importance for the future of research, economy and society.

The Mostly Color Channel

Tuesday, October 4, 2016

Metadata

No comments:

Post a Comment

Search This Blog

Featured Post

Meta-Palette

Understanding Color

Cognitive Aspects of Color

The Color Thesaurus...

Popular Posts

Blog Archive

Labels

Contributors

Blogroll

About this blog

Privacy Policy