Tuesday, March 18, 2014

Traps in big data analysis

When I was a student, I had chosen mathematical statistics as one of my majors. At the time, the hot topics were robust statistics, non-parametric methods and optimal stopping times. Descriptive statistics was not part of the curriculum (PowerPoint did not yet exist and there was no need for meaningless 3-D pie charts).

In the student houses I lived, there were always medical students at the end of their studies who had to get a doctorate. Residencies were grueling and at that time the least effort thesis was to punch in some historical medical data. On their way home from the clinic, these students would spend part of the night in the empty punch card rooms, for about 6 months.

Thereafter, they would bring the punch cards to the data center and get 10 to 20 centimeters of SAS printout—and the desperation of not knowing how to get from hundreds of cryptic tables to a one hundred page thesis.

Many of them ended up knocking on my door with the printout and scratching their head. Because in the data center the students could not tell what analyses they needed—after all, there never was an experimental design—the data center people just ran all and every function available in SAS. Classical garbage-in garbage-out.

So, I had to tell the students to stare at the data and come up with a few hypotheses, then use the ANOVA routines to confirm them and the regression routines to do a few nice graphs.

Unfortunately, after all these years we are not much better off. Indeed, now we have to deal also with "big data hubris," the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. Now we have tools like Google Correlate that allow us to correlate tons of apples with megatons of oranges.

A recent interesting paper by David Lazer et al. is a nice summary of how big data analysis allows us to create more statistical garbage: Lazer D, Kennedy R, King G, Vespignani A. Big data. The parable of Google Flu: traps in big data analysis. Science. 2014 Mar 14;343(6176):1203-5. doi: 10.1126/science.1248506. PubMed PMID: 24626916.

The authors conclude: "Big data offer enormous possibilities for understanding human interactions at a societal scale, with rich spatial and temporal dynamics, and for detecting complex interactions and nonlinearities among variables. We contend that these are the most exciting frontiers in studying human behavior. However, traditional 'small data' often offer information that is not contained (or containable) in big data, and the very factors that have enabled big data are enabling more traditional data collection. The Internet has opened the way for improving standard surveys, experiments, and health reporting. Instead of focusing on a 'big data revolution,' perhaps it is time we were focused on an 'all data revolution,' where we recognize that the critical change in the world has been innovative analytics, using data from all traditional and new sources, and providing a deeper, clearer understanding of our world."