The Mostly Color Channel: Traps in big data analysis

Tuesday, March 18, 2014

Traps in big data analysis

When I was a student, I had chosen mathematical statistics as one of my majors. At the time, the hot topics were robust statistics, non-parametric methods and optimal stopping times. Descriptive statistics was not part of the curriculum (PowerPoint did not yet exist and there was no need for meaningless 3-D pie charts).

In the student houses I lived, there were always medical students at the end of their studies who had to get a doctorate. Residencies were grueling and at that time the least effort thesis was to punch in some historical medical data. On their way home from the clinic, these students would spend part of the night in the empty punch card rooms, for about 6 months.

Thereafter, they would bring the punch cards to the data center and get 10 to 20 centimeters of SAS printout—and the desperation of not knowing how to get from hundreds of cryptic tables to a one hundred page thesis.

Many of them ended up knocking on my door with the printout and scratching their head. Because in the data center the students could not tell what analyses they needed—after all, there never was an experimental design—the data center people just ran all and every function available in SAS. Classical garbage-in garbage-out.

So, I had to tell the students to stare at the data and come up with a few hypotheses, then use the ANOVA routines to confirm them and the regression routines to do a few nice graphs.

Unfortunately, after all these years we are not much better off. Indeed, now we have to deal also with "big data hubris," the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. Now we have tools like Google Correlate that allow us to correlate tons of apples with megatons of oranges.

A recent interesting paper by David Lazer et al. is a nice summary of how big data analysis allows us to create more statistical garbage: Lazer D, Kennedy R, King G, Vespignani A. Big data. The parable of Google Flu: traps in big data analysis. Science. 2014 Mar 14;343(6176):1203-5. doi: 10.1126/science.1248506. PubMed PMID: 24626916.

The authors conclude: "Big data offer enormous possibilities for understanding human interactions at a societal scale, with rich spatial and temporal dynamics, and for detecting complex interactions and nonlinearities among variables. We contend that these are the most exciting frontiers in studying human behavior. However, traditional 'small data' often offer information that is not contained (or containable) in big data, and the very factors that have enabled big data are enabling more traditional data collection. The Internet has opened the way for improving standard surveys, experiments, and health reporting. Instead of focusing on a 'big data revolution,' perhaps it is time we were focused on an 'all data revolution,' where we recognize that the critical change in the world has been innovative analytics, using data from all traditional and new sources, and providing a deeper, clearer understanding of our world."

No comments:

Post a Comment

About this blog

The Internet is an amalgam of forms blurred under epistemological pressures. In Søren Kierkegaard’s words, under this flat shower of leveled information, where everybody is interested in everything and nothing is too trivial or too important, people just accumulate information and postpone decisions indefinitely, i.e., nobody takes action and nobody is responsible for truth — there is no mastery, just gossip. He called this the æsthetic sphere of existence, exhorting us to evolve to the ethical sphere, where we do not just accumulate information but take action and make commitments. Blogs are instruments to overcome flatness by creating opportunities for vertical activities. In this sense this blog is a view from my window — a collection of tidbits I judged relevant to computational color science and in general to the promotion of scientific excellence in areas of strategic importance for the future of research, economy and society.

The Mostly Color Channel

Tuesday, March 18, 2014

Traps in big data analysis

No comments:

Post a Comment

Search This Blog

Featured Post

Meta-Palette

Understanding Color

Cognitive Aspects of Color

The Color Thesaurus...

Popular Posts

Blog Archive

Labels

Contributors

Blogroll

About this blog

Privacy Policy