Wednesday, March 30, 2016

Big Dada

If you mistyped and where searching for "big data," try this other post. Here we are on 5 February 1916 in the Cabaret Voltaire on the Spiegelgasse, just a few steps from the apartment where Lenin was living in exile and not far from where I lived for a couple of years under a gaslight.

Cabaret Voltaire

Dada represents the total doubt about everything, the absolute individualism and the destruction of ideals and norms hitherto cast in concrete. For example, in Hugo Ball's Lautgedichte (sound poetry) the utterances are decoupled from semantics. This is how a century later naive users misuse Internet search engines and then wonder their queries just return millions of dada, much to the chagrin of the computational linguists trying to design semantic search engines. This is dada. No, it is a trigram. Can a second order Markov model help? Not for this trigram: Google Books thinks, it does not exist. Coming up with a new sentence is so dada.

The dadaists were displeased with science, which they judged to be elitist and far from any comprehensibility and sensuality. Maybe they were not completely right, considering what was happening in theoretical physics around that time. But certainly, today science is more dada, when fresh laureates dance their Ph.D. You can win a trip to Stanford and visit me, just a few steps away, under an LED lamppost.

Tuesday, March 15, 2016

bottom-up vs. top-down

Last week I wrote about systems of autonomous components, also dubbed as bees vs. the bee hive. In management, there is the somewhat related concept of bottom-up vs. top-down management.

Bottom-up used to be popular in engineering companies. Engineers work in small groups that push beyond the bleeding edge of technology and invent new product concepts. As the product evolves, more people are recruited and the product is polished by UX experts, manufacturability experts, marketing and sales teams, etc.

In a top-down company, the leader of the company is an expert visionary who has a new technology idea. A team of second level executives is assembled, which repeats the process for the next level, etc., down to the worker bees who develop the product. This was the basis of the utopian concept described in Tommaso Campanella's Città del Sole.

Der Schulmeister und seine Schüler

The preference for one paradigm or the other oscillates with time. Both are viable. Things only go wrong when a mediocre person becomes the head honcho of a bottom-up company and transforms it to top-down. In a bottom-up company, the management's role is mostly to take out of the way the obstacles that slow down the engineers. Typically their nature is more that of a facilitating person than that of a leader.

When the polarity of a company is switched from bottom-up to top-down, the management layers typically fail. With a mediocre person at the top, the company is doomed. It can take decades, but in the end, there is no escape from the spiral of death.

Friday, March 11, 2016

systems of autonomous components

In the discussions about big data and scalability, we learned how Gunther's universal scalability law suggests scaling out.

Another way to look at vertical vs. horizontal scaling is to compare a whale to a fish school. A school of small fish has the same biomass as a blue whale. A whale takes about 5 minutes to turn 180º. On the other hand, a school of small fish switches direction in an instant. The blue whale has no escape when under attack!

A school of small fish has the same biomass as a blue whale. A whale takes about 5 minutes to turn 180º. On the other hand, a school of small fish switches direction in an instant.

Leo Lionni's 1963 picture book スイミー (Swimmy) is particularly popular in Japan because it conveys the message that together we are strong, even though we are small. This is related to the concepts of synergy and emergent property.

In the case of big data and scalability, the morale of the story is that you do not want a rigid powerful central authority, but a plurality of autonomous components orchestrated to play the same symphony, with a minimum of coherency overhead. This resonates with Industry 4.0.

Thursday, March 10, 2016

scale-up and scale-out

In the post on big data we mentioned Gunther's universal scalability model

universal scalability law

In this model, p is the number of processors or clusters nodes and Sp is the speedup with p nodes. σ and κ represent the degree of contention in the system, respectively the lack of coherency in the distributed data. An example for the contention is waiting for message queueing (bottleneck saturation) and an example for incoherency is updating the processor caches (non-local data exchange).

When we do measurements, Tp is the runtime on p nodes, and Sp = T1 / Tp is the speedup with p nodes. When we have enough data, we can estimate σ and κ for our system and dataset using nonlinear statistical regression.

The model makes it easy to understand the difference between scale-up and scale-out architectures. In a scale-up system, you can increase the speedup by optimizing the contention, for example by adding memory or by bonding network ports. When you play with σ, you will learn that you can increase the speedup, but not the number of processors where you have the maximum speedup, which remains at 48 nodes in the example in Gunther's pape.

USL for the TeraSort experiment

In a scale-out architecture, you play with κ and you learn that you can additionally move the maximum over the number of nodes. In Gunther's paper, they can move the maximum to 95 nodes by optimizing the system to exchange fewer data.

This shows that scale-up and scale-out are not simply about using faster system component vs. using more components in parallel. In both cases, you have a plurality of nodes, but you optimize the system differently. In scale-up, you find bottlenecks and then mitigate them. In scale-out, you also work on the algorithms to reduce data exchange.

Since the incoherency term is quadratic, you get more bang for the bucks by reducing the coherency workload. This leads to adding more nodes instead of increasing the performance of the nodes, the latter usually being a much more expensive proposition.

In big data, scale-out or horizontal scaling is the key approach to achieve scalability. While this is obvious to anybody who has done GP-GPU programming, it is less so for those who are just experienced in monolithic apps.

Tuesday, March 8, 2016

big data

A few years ago, when "big data" became a buzzword, I attended an event from a major IT vendor that was about the new trends in the sector. There were presentations on all the hot buzzwords, including a standing room only session on big data. After the obligate corporate title slide, came a slide with a gratuitous unrelated stock photo on the left side, while the right side was taken up by the term "big data" in a font size filling the entire right half of the slide. Unfortunately, the rest of the presentation only contained platitudes, without any actionable information. My takeaway was that "big data" is a new buzzword that is written in a big font size.

A meaningless graphical representation of big data

After the dust settled, big data became associated with the three characteristics data volume, velocity, and variety proposed in 2001 by META's Doug Laney as the characteristics of "3-d data management" (META is now part of Gartner). Indeed, today Gartner defines big data as high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.

Volume indicates that the amount of data at rest is of petabyte scale, which requires horizontal scaling. Variety indicates that the data is from multiple domains or types: in a horizontal system, there are no vertical data silos. Velocity refers to the rate of flow; a large amount of data in motion introduces complications like latencies, load balancing, locking, etc.

In the meantime, other 'V' terms have been added, like variability and veracity. Variability refers to a change in velocity: a big data system has to be self-aware, dynamic, and adaptive. Veracity refers to the trustworthiness, applicability, noise, bias, abnormality and other quality properties in the data.

A recent understanding is the clear distinction between scale-up and scale-out systems with a powerful model called universal scalability model dissected in a CACM paper: N. J. Gunther, P. Puglia, and K. Tomasette. Hadoop superlinear scalability. Communications of the ACM, 58(4):46–55, April 2015.

This model allows us to state that Big Data refers to extensive data sets, primarily in the characteristics of volume, velocity, and/or variety, that requires a horizontally scalable architecture for efficient storage, manipulation, and analysis, i.e., for extracting value.

To make this definition actionable, we need an additional concept. Big Data Engineering refers to the storage and data manipulation technologies that leverage a collection of horizontally coupled resources to achieve nearly linear scalability in performance. Now we can drill down a little.

New engineering techniques in the data layer have been driven by the growing prominence of data types that cannot be handled efficiently in a traditional relational model. The need for scalable access to structured and unstructured data has led to software built on name–value / key–value pairs or columnar (big table), document-oriented, and graph (including triple–store) paradigms. A triple-store or RDF store is a purpose-built database for the storage and retrieval of triples through semantic queries. A triple is a data entity composed of subject-predicate-object, like "Bob is 35" or "Bob knows Fred."

Due to the velocity, it is usually not possible to structure the data when it is acquired (schema-on-write). Instead, data is often stored in raw form. Lazy evaluation is used to cleanse and index data as it is being queried from the repository (schema-on-read). This point is critical to understand because to run efficiently, analytics requires the data to be structured.

End-to-end data lifecycle categorizes the steps as collection, preparation, analysis, and action. In a traditional system, the data is stored in persistent storage after it has been munged (extract, transform, load, followed by cleansing; a.k.a. wrangling or shaping). In traditional use cases, the data is prepared and analyzed for alerting: schema-on-write. Only afterward the data or aggregates of the data are given persistent storage. This is different from high-velocity use cases, where the data is often stored raw in persistent storage: schema-on-read.

Veracity refers to the trustworthiness, applicability, noise, bias, abnormality and other quality properties in the data. Current technologies cannot assess, understand, exploit, and understand veracity throughout the data lifecycle. This is a big data characteristic that presents many opportunities for disruptive products.

Because in big data systems IO bandwidth is often the limiting resource, yet processor chips can have idle cores due to the IO gap (the number of cores on a chip is increasing, while the number of pins is constant), big data engineering seeks to embed some local programs like filtering, parsing, indexing, and transcoding in the storage nodes. This is only possible when the analytics and discovery systems are tightly integrated with the storage system. The analytics programs must be horizontal not only in that they process the data in parallel, but operations that access localized data, when possible, are distributed to separate cores in storage nodes, depending on the IO gap size.

With this background, we can attempt a formal definition of big data:

Big Data is a data set(s), with characteristics (e.g., volume, velocity, variety, variability, veracity) that for a particular problem domain at a given point in time cannot be efficiently processed using current / existing / established / traditional technologies and techniques in order to extract value.

Big data is a relative and not an absolute term. Big data essentially focusses on the self-referencing viewpoint that data is big because it requires scalable systems to handle, and architectures with better scaling have come about because of the need to handle big data.

The era of a trillion sensors is upon us. Traditional systems cannot extract value from the data they produce. This has stimulated Peaxy to invent new ways for scalable storage across a collection of horizontally coupled resources, and a distributed approach to querying and analytics. Often, the new data models for big data are lumped together as NoSQL, but we can classify them at a finer scale as big table, name–value, document, and graphical models, with the common implementation paradigm of intransigent distributed computing.

A key attribute of advanced analytics is the ability to correlate and fuse the data from many domains and types. In a traditional system, data munging and pre-analytics are used to extract features that allow integrating with other data through a relational model. In big data analytics, the wide range of data formats, structures, timescales, and semantics we want to integrate into an analysis presents a complexity challenge. The volume of the data can be so large that it cannot be moved to be integrated, at least in raw form. Solving this problem requires a dynamic, self-aware, adaptive storage system that is integrated with the analytics system and can optimize the locus of the various operations depending on the available IO gap and the network bandwidth at every instant.

In big data, we often need to process the data in real-time or at least near-real-time. Sensors are networked and can even be meshed, where the mesh can perform some recoding functions. The data rate can vary over orders of magnitude in very short time intervals. While in communications technologies, streaming technology has been perfected over many decades, in big data, the data flow and its variability are still largely unexplored territory. The data rate is not the only big data characteristic presenting variability challenges. Variability refers also to changes in format or structure, semantics, and quality.

In view of the recent world events—from Edward Snowden's leaks to Donald Trump's solicitation on Bill Gates to close the Internet—the privacy and security issues for big data are stringent but also in great flux due to the elimination of the Safe Harbor framework. According to a 2015 McKinsey study, cybersecurity is the greatest risk for Industry 4.0, with 50 billion € annual reported damage to the German manufacturing industry caused by cyberattacks.

With all these requirements, it is not surprising that there is only a handful of storage products that are both POSIX and HDFS compliant, are highly available, cybersecure, fully distributed and scalable, etc.

Wednesday, March 2, 2016

Architecture vs. baling wire and chewing gum

Over thirty years ago, I was working in the Dragon project. The full-custom VLSI chips were so complex, that we were not able to use the commercial tools of the time. Thus, part of the team was working on the development of a set of tools design the chips; I was working on the design rule checker.

Since we were concomitantly writing the tools, designing the chips, writing the compiler, and porting the operating system, we all had to work in parallel. For the tools, this was made possible by having each tool have its own file format and the data was moved from one workflow step to the next in a file. For example, the design rule checker parsed in the layout file and emitted the files for the logic and time simulators.

The challenge was, that this was not really a development project but a research project: we were inventing the algorithms while we were implementing the tools using them. One consequence was, that the file formats kept changing and this broke the parsers. This was painful, especially because we had hard deadlines with the MOSIS service we were using.

One day, during another no-so-happy meeting, I asked the question: "why are we using files at all, we are just dealing with different representations of the same chip. There should be just one data structure and the design tools should decorate it, analyze it, or render it."

The idea stuck and two colleagues designed and implemented the Core data structure. The layout editor would add geometry to the Core data structure and the various design rule checkers and simulation programs would analyze the data structure elements and add their own decorations. Almost immediately, a designer added a schematics editor and we wrote a program to generate the documentation by traversing the Core.

In a way, the original system with the parsers was built with baling wire and chewing gum, while the Core data structure introduced a sophisticated architecture. The various programs became simpler and so we were able to do nightly builds of the entire system. The entire design process became much more efficient and the tools more powerful.

Today, it has become very easy to quickly rig up a system based on open source software. A large number of powerful libraries is available and many produce very compelling eye candy.

Not too long ago, I had to build a prototype where a user could take a mobile phone picture of a store receipt and see on the display the receipt with the prices of a competing store. Since the demo for the customer was imminent, I had to work like a beaver to set up an OCR server, figure out how to get through the firewall, build a price translation table for the two stores, and set up a website to display the doctored receipt.

It took me two weeks and the wabi-sabi result looked like a prototype. Concomitantly, over a weekend, a bunch of young programmers rigged up an app that took a picture, replaced the top quarter with the bitmap of the top of the competing store's receipt to change the store block, leaving the item list unchanged, and displayed the result. This was even worse than bailing wire and chewing gum. However, they spent time making the GUI look slick.

At the end of the day, they got lot of compliments, but since there was no architecture, they would never have been able to implement a working system. In particular, doing the price translation is difficult. They would not have been able to build a real system.

Unfortunately, today's plethora of libraries makes it not only possible to write good applications, but also to quickly rig something up without understanding what is going on. There is much less interest in well-architected powerful software than used to be. Fortunately, the economy works in waves and the next downturn will flush out the system. The smart samurai does not fight his enemy: he patiently sits downstream the river to wait for the body of his enemy to float by.

waiting by the river