Monday, April 24, 2017

Juggling Tools

Discussions about imaging invariably mention imaging pipelines. A simple pipeline to transform the image data to a different color space may have three stages: a lookup table to linearize the signal, a linear approximation to the second color space, and a lookup table to model the non-linearity of the target space. As an imaging product evolves, engineers add more pipeline stages: tone correction, gamut mapping, anti-aliasing, de-noising, sharpening, blurring, etc.

In the early days of digital image processing, researchers quickly realized that imaging pipelines should be considered harmful because, due to discretization, at each stage, the resulting image space became increasingly sparse. However, in the early 1990s, with the early digital cameras and consumer color printers, imaging pipelines came back. After some 25 years of experience, engineers have become more careful with the pipelines, but they are still a trap.

In data analytics, people often make a similar mistake. There are also three basic steps, namely data wrangling, statistical analysis, and presentation of the result. As development progresses, the analysis becomes richer; when the data is a signal, it is filtered in various ways to create different views, statistical analyses are applied, the data is modeled, classifiers are deployed, estimates and inferences are computed, etc. Each step is often considered as a separate task, encapsulated in a script that parses in a comma separated values (CSV) data file, calls one or more functions, and the writes out a new CSV file for the next stage.

The pipeline is not a good model to use when architecting a complex data processing endeavor.

I cannot remember if it was 1976 or 1978 when at PARC the design of the Dorado was finished and Chuck Thacker hand-wrote the first formal note on the next workstation: the Dragon. While the Dorado had a bit-sliced processor in ECL technology, the Dragon was designed as a multi-processor full-custom VLSI system in nMOS technology.

The design was much more complex than any chip design that had been previously attempted, especially after the underlying technology was switched from nMOS to CMOS. It became immediately evident that it was necessary to design new design automation (DA) tools that could handle such big VLSI chips.

A system based on full-custom VLSI design was a sequence of iterations of the following steps: design a circuit as a schematic, lay out the symbolic circuit geometry, check the design rules, perform logic and timing analysis, create a MOSIS tape, debug the chip. Using stepwise refinement, the process was repeated at the cadence of MOSIS runs. In reality, the process was very messy, because, at the same time, the physicists were working on the CMOS fab, the designers were creating the layout, the DA people were writing the tools, and the system people were porting the Cedar operating system. Just in the Computer Science Laboratory alone, about 50 scientists were working on the Dragon project.

The design rule checker Spinifex played a somewhat critical role, because it parsed the layout created with ChipNDale, analyzed the geometry, flagged the design rule errors, and generated the various input files for the logic simulator Rosemary and the timing simulator Thyme. Originally, Spinifex was an elegant hierarchical design rule checker, which allowed to verify all the geometry for a layout in memory. However, with the transition from nMOS to CMOS, the designers transitioned more and more to a partially flat design, which broke Spinifex. The situation was exacerbated by the endless negotiations between designers and physicists to allow for exceptions to the rules, leading to a number of complementary specialized design rule checkers.

With 50 scientists on the project, ChipNDale, Rosemary, and Thyme were also rapidly evolving. With the time pressure of the tape-outs, there were often inconsistencies in the various parsers. As the whipping boy in the middle of all this, one morning, while showering, I had an idea. The concept of a pipeline was contra naturam compared to the work process. The Smalltalk researchers on the other end of the building had an implementation process where a tree structure described some gestalt and methods would be written that decorate this representation of the gestalt.

In the following meeting, I proposed to define a data structure representing a chip. Tools like the circuit designer, the layout design tool, and the routers would add to the structure while tools like the design rule checkers and simulators would analyze the structure, with their output being further decorations added to the data structure. Even the documentation tools could be integrated. I did not expect this to have any consequence, but there were some very smart researchers in the room. Bertrand Serlet and Rick Barth implemented this paradigm and project representation and called it Core.

The power was immediately manifest. Everybody chipped in: Christian Jacobi, Christian Le Cocq, Pradeep Sindhu, Louis Monier, Mike Spreitzer and others joined Bertrand and Rick in rewriting the entire tool set around Core. Bob Hagman wrote the Summoner, which summoned all Dorados at PARC and dispatched parallel builds.

Core became an incredible game changer. While before there was never an entirely consistent system, now we could do nightly builds of the tools and the chips. Besides, the tools were no longer broken at the interfaces all the time.

The lubricant of the Silicon Valley are the brains wandering from one company to the other. When one brain wandered to the other side of the Coyote Hill, the core concept gradually became an important architectural paradigm that is on the basis of some modern operating systems.

If you are a data scientist, do not think in terms of scripts for pipelines connected by CSV files. Think of a core structure representing your data and the problem you are trying to solve. Think about literate programs that decorate your core structure. When you make the core structure persistent, think rich metadata and databases, not files with plain tables. Last but not least, also your report should be generated automatically by the system.

data + structure = knowledge