The Mostly Color Channel: Finding structure in large data sets

Tuesday, January 3, 2012

Finding structure in large data sets

In color science we do not have sufficient knowledge about the visual system to formulate analytical models. However, in the past two centuries we have accumulated sufficient knowledge, that we do not have to resort to machine learning approaches like hidden Markov methods either. We are able to come up with equations that correlate well with what is perceived.

One would then assume, that color scientists excel in the mathematical statistics branch of correlations. Alas, the practice is less glorious and we tend to rely on Pearson's correlation to gauge the strength of association between pairs of stochastic variables. The caveat is that Pearson's correlation captures only linear correlation.

Based on Shannon's information theory and entropy, in 1957 Linfoot developed the concept of mutual information MI. However, after 50 years of research it is still too intractable to fish in large data sets like those we get in crowdsourcing to discover the structure of the data. Unless we know what to look for, there is little hope for serendipity.

Fortunately, Reshov et al. have now come up with a new exploratory data analysis tool they call maximal information coefficient (MIC). MIC is general, in the sense it does not assume a linear association, nor any other function type. MIC is also equitable, in that it is robust with respect to noise. In the special case of linear associations, it gives the same results as Pearson's correlation.

Returning now to our large data sets, Reshov et al. have gone two steps further, developing a larger family of statistics they call MINE, for maximal information-based nonparametric exploration. In mathematical statistics, nonparametric means that no assumption is made on the random variable's distribution. They make available their software in the form of a Java program (this is the second step).

Their paper Detecting Novel Associations in Large Data Sets is in Science magazine Vol. 334, pp. 1518–1524.

No comments:

Post a Comment

About this blog

The Internet is an amalgam of forms blurred under epistemological pressures. In Søren Kierkegaard’s words, under this flat shower of leveled information, where everybody is interested in everything and nothing is too trivial or too important, people just accumulate information and postpone decisions indefinitely, i.e., nobody takes action and nobody is responsible for truth — there is no mastery, just gossip. He called this the æsthetic sphere of existence, exhorting us to evolve to the ethical sphere, where we do not just accumulate information but take action and make commitments. Blogs are instruments to overcome flatness by creating opportunities for vertical activities. In this sense this blog is a view from my window — a collection of tidbits I judged relevant to computational color science and in general to the promotion of scientific excellence in areas of strategic importance for the future of research, economy and society.

The Mostly Color Channel

Tuesday, January 3, 2012

Finding structure in large data sets

No comments:

Post a Comment

Search This Blog

Featured Post

Meta-Palette

Understanding Color

Cognitive Aspects of Color

The Color Thesaurus...

Popular Posts

Blog Archive

Labels

Contributors

Blogroll

About this blog

Privacy Policy