Tuesday, January 3, 2012

Finding structure in large data sets

In color science we do not have sufficient knowledge about the visual system to formulate analytical models. However, in the past two centuries we have accumulated sufficient knowledge, that we do not have to resort to machine learning approaches like hidden Markov methods either. We are able to come up with equations that correlate well with what is perceived.

One would then assume, that color scientists excel in the mathematical statistics branch of correlations. Alas, the practice is less glorious and we tend to rely on Pearson's correlation to gauge the strength of association between pairs of stochastic variables. The caveat is that Pearson's correlation captures only linear correlation.

Based on Shannon's information theory and entropy, in 1957 Linfoot developed the concept of mutual information MI. However, after 50 years of research it is still too intractable to fish in large data sets like those we get in crowdsourcing to discover the structure of the data. Unless we know what to look for, there is little hope for serendipity.

Fortunately, Reshov et al. have now come up with a new exploratory data analysis tool they call maximal information coefficient (MIC). MIC is general, in the sense it does not assume a linear association, nor any other function type. MIC is also equitable, in that it is robust with respect to noise. In the special case of linear associations, it gives the same results as Pearson's correlation.

Returning now to our large data sets, Reshov et al. have gone two steps further, developing a larger family of statistics they call MINE, for maximal information-based nonparametric exploration. In mathematical statistics, nonparametric means that no assumption is made on the random variable's distribution. They make available their software in the form of a Java program (this is the second step).

Their paper Detecting Novel Associations in Large Data Sets is in Science magazine Vol. 334, pp. 1518–1524.