Tuesday, April 19, 2016

Different analytics tools for different jobs

When people talk about analytics jobs, they usually have a mental picture of a single job and skill set. They talk about analysts, or data analysts (in the Silicon Valley they may be called data scientists). However, we can structure the users of analytics tools by the kind of job they have. The builders of these tools must then have the same skills, but at a much deeper level.

The first job type is the office worker. Today, every employee is expected to be able to produce some analytics. The basic tools include the office suites from Microsoft, Google, or Apple. Proficiency in more specialized tools like Adobe Illustrator, InDesign, Acrobat, FileMaker, and Tableau are a plus. The worker is expected to be able to convert data between formats like CSV and Excel. Workers are typically given assignments like “Prepare a presentation explaining our performance and suggesting how it can be improved.” Therefore, office workers must be able to produce visualizations, where visualization refers to being able to produce graphics from tables using Adobe Illustrator, Microsoft Excel, and PowerPoint. By nature, the office workers are domain experts in their daily activities.

The second job type is that of a data analyst in a traditional company. The all-round data analyst must be proficient in a relational database system like MySQL, and in Excel. The analyst must also have a good understanding of descriptive statistics. A key skill is to be an expert in munging data across applications and file formats; this is also known as data shaping, wrangling, ETL, etc. The required statistical expertise is not deep, but basic A/B testing and Google Analytics experience are required. Presenting and selling the results of an analysis are very important, requiring the ability to be able to do basic data visualization in Excel and Tableau. The data analyst has to have a good understanding of the company’s products and general all-round skills.

The third job type is that of an analyst in a start-up company, where a typical assignment may sound like "please munge our data." This requires proficiency in the basic tools and the ability to move fast: go for the low-hanging fruits and be able to quickly implement a new analysis or visualization by writing Excel macros, Access programs, or R functions, which in turn requires a good knowledge of the available libraries in Excel, R, or Tableau. The data analyst in a start-up company must be proficient in the implementation of advanced parsers and creating ad hoc MySQL databases for persistent storage. Basic statistics knowledge, for example, contingency tables and Poisson tests, are also a must. Since a start-up does not have historical data, the analyst must be able to do the ground-truthing by themselves. As a lot of the data may come from social networks, this job type also requires the ability to use linguistics functions to clean up unstructured text and extract useful information.

An analyst in a data company has a completely different job. Here data is the product: “we are data — data is us.” This requires a formal background in mathematics, statistics, machine learning, or linguistics (natural language processing, NLP). The analyst must be able to discriminate among the various algorithms and understand their parameters. On the bright side, most data is already munged, but the analyst must be able to customize parsers and workflows. Understanding privacy laws is a must, especially the European ones because the internet has no borders, but the laws have and the fines can be debilitating. The analyst in a data company must have a good sense of emerging techniques, like topological data analysis.

The fifth job type is that of analysts in an enterprise, where they are members of an established data team with experts in various tools. By enterprise here we mean a reasonable sized non-data company who is data-driven, to distinguish it from the second job type. The work is about data, but data is often not central to the product. An example is the fourth industrial revolution, or industry 4.0. This analyst is a generalist with broad experience, a jack-of-all-trades. For survival, this analyst must be able to find blind spots where niche roles can be played. It requires heavy experience in munging and aggregating data from all possible sources: SQL and NoSQL, logs, IoT, social networks (Twitter, LinkedIn, Facebook, etc.), news feeds, REST services, data.gov, Google Public Data Explorer, etc.

We can summarize these job types and the skills they require in this table:

Skills for analytics jobs

This is a generalization and it can be debated. For example, graph theory is topology, actually the historical beginning of it, but topological data analysis focuses on point clouds to build graphs, while traditional graph theory uses completely different mathematical tools to analyze graphs which is why I listed them as two different items. One could also make this list summarizing skills:

  • Tools of the trade: SQL, R, Java, Scala, Python, Spark, MapReduce, …
  • Basic statistics: distributions, maximum likelihood estimation, statistical tests, regression, …
  • Machine learning: k-nearest neighbors, random forests, …
  • Linear algebra and multivariate calculus
  • Data munging: imputation, parsing, formatting; aka wrangling, shaping
  • Data visualization and communication: Tableau, ggplot, d3.js
  • Software engineering: logging, performance analysis, REST interfaces, connectors, …
  • Curiosity for emerging technologies, like algebraic topology
  • Thinking like a data scientist: business sense, approximations, teamwork, …

Monday, April 18, 2016

Data Shamans

The benefit of attending meet-ups and conferences is that, compared to papers and webinars, you can hear and understand questions and you can talk to the speakers and other audience members during the breaks. Especially in conferences, in the formal presentation, you hear a scientific report, while in the breaks you can learn all the false turns the researchers have taken in their endeavors but have no place in short scientific communications.

As I mentioned last October regarding the ACM Data Science Camp Silicon Valley, the field of advanced analytics is full of hype. Data scientists are perceived like demigods, but in reality, their employment can be insecure and harsh.

Indeed, I often hear from data scientists that they are treated like shamans, i.e., a person regarded as having access to, and influence in, the world of benevolent and malevolent spirits, who typically enters into a trance state during a ritual, and practices divination and healing.

When the organization has a problem it cannot solve and the scientists or engineers are at the end of their wit, they collect big data and deposed it a the feet of their data scientists in the hope to get a miracle by the next day. A problem can only be solved when the causality is known, and correlation does not imply causality. There is no magic algorithm the data scientists can throw at the data and solve the engineering riddle.

In the end, the data scientists have to be able to go back to first principles. However, their training and experience make them more diffident to project preconceptions into the data, and their toolbox allows them to formulate hypotheses and test them statistically more efficiently. There are no data shamans.

Not a data shaman

Thursday, April 14, 2016

The Shape of Data

Last week I attended a San Francisco Bay Area ACM Chapter event at Pivotal Labs, which now occupies one of the former HP Labs buildings up on Deer Creek Road. The speaker was Gunnar Carlsson and the topic was algebraic topology analytics. I was waiting to write this post until the slides would be posted, but they never materialized—maybe the fancy rendering of a coffee mug metamorphosing into a topologically equivalent donut broke the system.

I must admit that what attracted me the most to attend was to see how Gunnar Carlsson would give a presentation on a very arcane topic requiring intimate familiarity with Betti numbers, functional persistence barcodes, simplicial complexes, and Vietoris-Rips complexes to the 244 registered attendees, probably mostly lacking the necessary mathematical background. I was also curious to see if he would reveal some of the fast algorithms which he must have invented to perform the very complex calculations involved.

He did a superb job! After the demonstration that for mathematicians coffee mugs and donuts are equivalent, he had everybody's attention. He then showed some examples of how conventional methods for regression and cluster analysis can fail and lead to completely incorrect conclusions, leaving the task of understanding topological pattern recognition for point cloud data as an exercise.

Gunnar Carlsson started by noting that big data in not about "big" but about complexity in format and structure. Data has shape and shape matters. Therefore, the task of the data scientist is not to accumulate myriad data, but to simplify the data in a way that the shape is easily inferred.

Consider for example the point cloud on the left side of the figure below. You can import it in your favorite analytics program and perform a linear regression. This simplifies the data to two parameters: a slope and an intercept. However, if you look more carefully, you see that this an incorrect representation of the data. Indeed, the point cloud is on two intersecting lines, therefore, the green cross at the right is a more accurate representation of the data's shape.

A linear regression would give an incorrect result

A second example is the confusion between recurrent data and periodic data. People tend to equate them and then use Fourier analysis on recurrent data that is not periodic, getting meaningless results. Recurrence is a concept from chaos theory and does not imply regular cycles, like the El Niño Southern Oscillation (ENSO).

The solution is to use topological modeling, like in the figure. If you are older, you need to revisit topology, because the field has started to study point clouds only in the last 20 to 15 years.

The first step in a project is to determine a relevant distance metric. Examples include Euclidean distance, Hamming distance, and correlation distance. The distance metric should be such that it is sensitive to nearby events but not so much to far away events because the interesting stuff happens close by: consider for example a distance metric based on the statistical momenta.

The output of an algebraic topology analysis is not a set of algebraic formulæ but a network.

For exercising, Carlsson recommends the World Values Survey, which contains a lot of interesting data. When you play with the data, it is often useful to consider a topological model of the space of columns rather than the rows in a data set.

Adapting to street lighting

Although in Switzerland light pollution is a fraction of what it is in American urban areas, in an interesting study, Florian Altermatt from Zurich and Dieter Ebert from Basel have shown experimentally that moths have adapted to light pollution. They reared moths from 10 different populations from early-instar larvae and experimentally compared their flight-to-light behavior under standardized conditions. Moths from urban populations had a significant reduction in the flight-to-light behavior compared with pristine populations.

This adaptation has direct consequences for ecosystems, because as moths avoid light pollution they will fly less and will pollinate fewer flowers.

If you delve into data science, beware that correlation does not predict causality: if Americans have become couch potatoes and move less than the Swiss, this is not due to the difference in light pollution :-)

Citation: Florian Altermatt, Dieter Ebert: Reduced flight-to-light behaviour of moth populations exposed to long-term urban light pollution, Biol. Lett. 2016 12 20160111; DOI: 10.1098/rsbl.2016.0111. Published 12 April 2016.

Helvetia by Night

Tuesday, April 12, 2016

How to pronounce LIRE

One of the difficulties of the English language is that it is often not clear, how to correctly pronounce words. For example, Edinburgh, Gloucester, and Leicester are often mispronounced. Fortunately, a dictionary will teach you how to pronounce a word correctly, and if you do not find it there, you can try the Wikipedia.

The situation is a little more hairy in the case of acronyms. Often there is a pun, which can guide you in guessing the pronunciation, but sometimes it is a genuine acronym, and therefore impossible to guess.

In the age of big data, the default approach to crack this riddle is to crowd-source the answer on a social network. Blast the question on Twitter, write a small R program using twitterR, and calculate the statistical mode. You can even get creative and use the PBSmapping package to plot the locations of the responders.

Unfortunately, the crowds are not wise and they might never have crossed path with the acronym, so they just guess. This is a classical case of garbage-in, garbage-out. In German, this process is also known with the expression "mit Kanonen auf Spatzen schiessen."

People were debating how one might pronounce LIRE, a Java library that provides a simple way to retrieve images and photos based on color and texture characteristics. LIRE is an acronym for Lucene Image Retrieval and is used to build content based image retrieval (CBIR, rhymes with cyber) systems.

Nothing is easier than shooting a one-line email to the creator Mathias Lux in Klagenfurt (do not fear, the Lindwurm is made of stone and will not byte you) and you get the correct answer, no need for confidence intervals: "Ich persönlich gehe von der italienischen Sprechweise aus"— it is pronounced the Italian way: li like liberty plus re like record.

Wednesday, March 30, 2016

Big Dada

If you mistyped and where searching for "big data," try this other post. Here we are on 5 February 1916 in the Cabaret Voltaire on the Spiegelgasse, just a few steps from the apartment where Lenin was living in exile and not far from where I lived for a couple of years under a gaslight.

Cabaret Voltaire

Dada represents the total doubt about everything, the absolute individualism and the destruction of ideals and norms hitherto cast in concrete. For example, in Hugo Ball's Lautgedichte (sound poetry) the utterances are decoupled from semantics. This is how a century later naive users misuse Internet search engines and then wonder their queries just return millions of dada, much to the chagrin of the computational linguists trying to design semantic search engines. This is dada. No, it is a trigram. Can a second order Markov model help? Not for this trigram: Google Books thinks, it does not exist. Coming up with a new sentence is so dada.

The dadaists were displeased with science, which they judged to be elitist and far from any comprehensibility and sensuality. Maybe they were not completely right, considering what was happening in theoretical physics around that time. But certainly, today science is more dada, when fresh laureates dance their Ph.D. You can win a trip to Stanford and visit me, just a few steps away, under an LED lamppost.

Tuesday, March 15, 2016

bottom-up vs. top-down

Last week I wrote about systems of autonomous components, also dubbed as bees vs. the bee hive. In management, there is the somewhat related concept of bottom-up vs. top-down management.

Bottom-up used to be popular in engineering companies. Engineers work in small groups that push beyond the bleeding edge of technology and invent new product concepts. As the product evolves, more people are recruited and the product is polished by UX experts, manufacturability experts, marketing and sales teams, etc.

In a top-down company, the leader of the company is an expert visionary who has a new technology idea. A team of second level executives is assembled, which repeats the process for the next level, etc., down to the worker bees who develop the product. This was the basis of the utopian concept described in Tommaso Campanella's Città del Sole.

Der Schulmeister und seine Schüler

The preference for one paradigm or the other oscillates with time. Both are viable. Things only go wrong when a mediocre person becomes the head honcho of a bottom-up company and transforms it to top-down. In a bottom-up company, the management's role is mostly to take out of the way the obstacles that slow down the engineers. Typically their nature is more that of a facilitating person than that of a leader.

When the polarity of a company is switched from bottom-up to top-down, the management layers typically fail. With a mediocre person at the top, the company is doomed. It can take decades, but in the end, there is no escape from the spiral of death.

Friday, March 11, 2016

systems of autonomous components

In the discussions about big data and scalability, we learned how Gunther's universal scalability law suggests scaling out.

Another way to look at vertical vs. horizontal scaling is to compare a whale to a fish school. A school of small fish has the same biomass as a blue whale. A whale takes about 5 minutes to turn 180º. On the other hand, a school of small fish switches direction in an instant. The blue whale has no escape when under attack!

A school of small fish has the same biomass as a blue whale. A whale takes about 5 minutes to turn 180º. On the other hand, a school of small fish switches direction in an instant.

Leo Lionni's 1963 picture book スイミー (Swimmy) is particularly popular in Japan because it conveys the message that together we are strong, even though we are small. This is related to the concepts of synergy and emergent property.

In the case of big data and scalability, the morale of the story is that you do not want a rigid powerful central authority, but a plurality of autonomous components orchestrated to play the same symphony, with a minimum of coherency overhead. This resonates with Industry 4.0.

Thursday, March 10, 2016

scale-up and scale-out

In the post on big data we mentioned Gunther's universal scalability model

universal scalability law

In this model, p is the number of processors or clusters nodes and Sp is the speedup with p nodes. σ and κ represent the degree of contention in the system, respectively the lack of coherency in the distributed data. An example for the contention is waiting for message queueing (bottleneck saturation) and an example for incoherency is updating the processor caches (non-local data exchange).

When we do measurements, Tp is the runtime on p nodes, and Sp = T1 / Tp is the speedup with p nodes. When we have enough data, we can estimate σ and κ for our system and dataset using nonlinear statistical regression.

The model makes it easy to understand the difference between scale-up and scale-out architectures. In a scale-up system, you can increase the speedup by optimizing the contention, for example by adding memory or by bonding network ports. When you play with σ, you will learn that you can increase the speedup, but not the number of processors where you have the maximum speedup, which remains at 48 nodes in the example in Gunther's pape.

USL for the TeraSort experiment

In a scale-out architecture, you play with κ and you learn that you can additionally move the maximum over the number of nodes. In Gunther's paper, they can move the maximum to 95 nodes by optimizing the system to exchange fewer data.

This shows that scale-up and scale-out are not simply about using faster system component vs. using more components in parallel. In both cases, you have a plurality of nodes, but you optimize the system differently. In scale-up, you find bottlenecks and then mitigate them. In scale-out, you also work on the algorithms to reduce data exchange.

Since the incoherency term is quadratic, you get more bang for the bucks by reducing the coherency workload. This leads to adding more nodes instead of increasing the performance of the nodes, the latter usually being a much more expensive proposition.

In big data, scale-out or horizontal scaling is the key approach to achieve scalability. While this is obvious to anybody who has done GP-GPU programming, it is less so for those who are just experienced in monolithic apps.

Tuesday, March 8, 2016

big data

A few years ago, when "big data" became a buzzword, I attended an event from a major IT vendor that was about the new trends in the sector. There were presentations on all the hot buzzwords, including a standing room only session on big data. After the obligate corporate title slide, came a slide with a gratuitous unrelated stock photo on the left side, while the right side was taken up by the term "big data" in a font size filling the entire right half of the slide. Unfortunately, the rest of the presentation only contained platitudes, without any actionable information. My takeaway was that "big data" is a new buzzword that is written in a big font size.

A meaningless graphical representation of big data

After the dust settled, big data became associated with the three characteristics data volume, velocity, and variety proposed in 2001 by META's Doug Laney as the characteristics of "3-d data management" (META is now part of Gartner). Indeed, today Gartner defines big data as high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.

Volume indicates that the amount of data at rest is of petabyte scale, which requires horizontal scaling. Variety indicates that the data is from multiple domains or types: in a horizontal system, there are no vertical data silos. Velocity refers to the rate of flow; a large amount of data in motion introduces complications like latencies, load balancing, locking, etc.

In the meantime, other 'V' terms have been added, like variability and veracity. Variability refers to a change in velocity: a big data system has to be self-aware, dynamic, and adaptive. Veracity refers to the trustworthiness, applicability, noise, bias, abnormality and other quality properties in the data.

A recent understanding is the clear distinction between scale-up and scale-out systems with a powerful model called universal scalability model dissected in a CACM paper: N. J. Gunther, P. Puglia, and K. Tomasette. Hadoop superlinear scalability. Communications of the ACM, 58(4):46–55, April 2015.

This model allows us to state that Big Data refers to extensive data sets, primarily in the characteristics of volume, velocity, and/or variety, that requires a horizontally scalable architecture for efficient storage, manipulation, and analysis, i.e., for extracting value.

To make this definition actionable, we need an additional concept. Big Data Engineering refers to the storage and data manipulation technologies that leverage a collection of horizontally coupled resources to achieve nearly linear scalability in performance. Now we can drill down a little.

New engineering techniques in the data layer have been driven by the growing prominence of data types that cannot be handled efficiently in a traditional relational model. The need for scalable access to structured and unstructured data has led to software built on name–value / key–value pairs or columnar (big table), document-oriented, and graph (including triple–store) paradigms. A triple-store or RDF store is a purpose-built database for the storage and retrieval of triples through semantic queries. A triple is a data entity composed of subject-predicate-object, like "Bob is 35" or "Bob knows Fred."

Due to the velocity, it is usually not possible to structure the data when it is acquired (schema-on-write). Instead, data is often stored in raw form. Lazy evaluation is used to cleanse and index data as it is being queried from the repository (schema-on-read). This point is critical to understand because to run efficiently, analytics requires the data to be structured.

End-to-end data lifecycle categorizes the steps as collection, preparation, analysis, and action. In a traditional system, the data is stored in persistent storage after it has been munged (extract, transform, load, followed by cleansing; a.k.a. wrangling or shaping). In traditional use cases, the data is prepared and analyzed for alerting: schema-on-write. Only afterward the data or aggregates of the data are given persistent storage. This is different from high-velocity use cases, where the data is often stored raw in persistent storage: schema-on-read.

Veracity refers to the trustworthiness, applicability, noise, bias, abnormality and other quality properties in the data. Current technologies cannot assess, understand, exploit, and understand veracity throughout the data lifecycle. This is a big data characteristic that presents many opportunities for disruptive products.

Because in big data systems IO bandwidth is often the limiting resource, yet processor chips can have idle cores due to the IO gap (the number of cores on a chip is increasing, while the number of pins is constant), big data engineering seeks to embed some local programs like filtering, parsing, indexing, and transcoding in the storage nodes. This is only possible when the analytics and discovery systems are tightly integrated with the storage system. The analytics programs must be horizontal not only in that they process the data in parallel, but operations that access localized data, when possible, are distributed to separate cores in storage nodes, depending on the IO gap size.

With this background, we can attempt a formal definition of big data:

Big Data is a data set(s), with characteristics (e.g., volume, velocity, variety, variability, veracity) that for a particular problem domain at a given point in time cannot be efficiently processed using current / existing / established / traditional technologies and techniques in order to extract value.

Big data is a relative and not an absolute term. Big data essentially focusses on the self-referencing viewpoint that data is big because it requires scalable systems to handle, and architectures with better scaling have come about because of the need to handle big data.

The era of a trillion sensors is upon us. Traditional systems cannot extract value from the data they produce. This has stimulated Peaxy to invent new ways for scalable storage across a collection of horizontally coupled resources, and a distributed approach to querying and analytics. Often, the new data models for big data are lumped together as NoSQL, but we can classify them at a finer scale as big table, name–value, document, and graphical models, with the common implementation paradigm of intransigent distributed computing.

A key attribute of advanced analytics is the ability to correlate and fuse the data from many domains and types. In a traditional system, data munging and pre-analytics are used to extract features that allow integrating with other data through a relational model. In big data analytics, the wide range of data formats, structures, timescales, and semantics we want to integrate into an analysis presents a complexity challenge. The volume of the data can be so large that it cannot be moved to be integrated, at least in raw form. Solving this problem requires a dynamic, self-aware, adaptive storage system that is integrated with the analytics system and can optimize the locus of the various operations depending on the available IO gap and the network bandwidth at every instant.

In big data, we often need to process the data in real-time or at least near-real-time. Sensors are networked and can even be meshed, where the mesh can perform some recoding functions. The data rate can vary over orders of magnitude in very short time intervals. While in communications technologies, streaming technology has been perfected over many decades, in big data, the data flow and its variability are still largely unexplored territory. The data rate is not the only big data characteristic presenting variability challenges. Variability refers also to changes in format or structure, semantics, and quality.

In view of the recent world events—from Edward Snowden's leaks to Donald Trump's solicitation on Bill Gates to close the Internet—the privacy and security issues for big data are stringent but also in great flux due to the elimination of the Safe Harbor framework. According to a 2015 McKinsey study, cybersecurity is the greatest risk for Industry 4.0, with 50 billion € annual reported damage to the German manufacturing industry caused by cyberattacks.

With all these requirements, it is not surprising that there is only a handful of storage products that are both POSIX and HDFS compliant, are highly available, cybersecure, fully distributed and scalable, etc.