Thursday, April 14, 2016

The Shape of Data

Last week I attended a San Francisco Bay Area ACM Chapter event at Pivotal Labs, which now occupies one of the former HP Labs buildings up on Deer Creek Road. The speaker was Gunnar Carlsson and the topic was algebraic topology analytics. I was waiting to write this post until the slides would be posted, but they never materialized—maybe the fancy rendering of a coffee mug metamorphosing into a topologically equivalent donut broke the system.

I must admit that what attracted me the most to attend was to see how Gunnar Carlsson would give a presentation on a very arcane topic requiring intimate familiarity with Betti numbers, functional persistence barcodes, simplicial complexes, and Vietoris-Rips complexes to the 244 registered attendees, probably mostly lacking the necessary mathematical background. I was also curious to see if he would reveal some of the fast algorithms which he must have invented to perform the very complex calculations involved.

He did a superb job! After the demonstration that for mathematicians coffee mugs and donuts are equivalent, he had everybody's attention. He then showed some examples of how conventional methods for regression and cluster analysis can fail and lead to completely incorrect conclusions, leaving the task of understanding topological pattern recognition for point cloud data as an exercise.

Gunnar Carlsson started by noting that big data in not about "big" but about complexity in format and structure. Data has shape and shape matters. Therefore, the task of the data scientist is not to accumulate myriad data, but to simplify the data in a way that the shape is easily inferred.

Consider for example the point cloud on the left side of the figure below. You can import it in your favorite analytics program and perform a linear regression. This simplifies the data to two parameters: a slope and an intercept. However, if you look more carefully, you see that this an incorrect representation of the data. Indeed, the point cloud is on two intersecting lines, therefore, the green cross at the right is a more accurate representation of the data's shape.

A linear regression would give an incorrect result

A second example is the confusion between recurrent data and periodic data. People tend to equate them and then use Fourier analysis on recurrent data that is not periodic, getting meaningless results. Recurrence is a concept from chaos theory and does not imply regular cycles, like the El Niño Southern Oscillation (ENSO).

The solution is to use topological modeling, like in the figure. If you are older, you need to revisit topology, because the field has started to study point clouds only in the last 20 to 15 years.

The first step in a project is to determine a relevant distance metric. Examples include Euclidean distance, Hamming distance, and correlation distance. The distance metric should be such that it is sensitive to nearby events but not so much to far away events because the interesting stuff happens close by: consider for example a distance metric based on the statistical momenta.

The output of an algebraic topology analysis is not a set of algebraic formulæ but a network.

For exercising, Carlsson recommends the World Values Survey, which contains a lot of interesting data. When you play with the data, it is often useful to consider a topological model of the space of columns rather than the rows in a data set.