Wednesday, September 28, 2016

Extreme color naming experiment finds locus for luma–chroma transformation

From time to time, a new physiological experimentation technique or a significant new instrument is developed, leading to breakthrough discoveries. In the case of color vision, this usually entails a doctoral student and their significant other—and maybe some additional dedicated colleagues or the professor—to undergo the cruel ordeal of having their pupils dilated (mydriasis) and their ciliary muscle paralyzed to avoid accommodation (cycloplegia), then get strapped to a headrest while biting a dental impression mount, to make observations in repeated interminable sessions for months on end, all in the name of science.

In their recent paper The elementary representation of spatial and color vision in the human retina, Ramkumar Sabesan et al. report on a seminal study to locate where in the human visual system (HVS), the luma-chroma encoding occurs in the parvocellular pathway (midget ganglion cells).

This study by Ramkumar Sabesan et al. represents the first time cone photoreceptors of known spectral type have been individually targeted and activated with light in the living human retina.

I cannot believe, it has been thirty years since I drew this diagram:

cognitive model

Although the above diagram looks like a model for the HVS, it was more a plan for my implementation of a color workbench. To keep the head cool and prevent it from overheating, our brain evolved to minimize the usage of energy. This is accomplished by having pipelines where at each stage the information gets recoded to make it more complex but more compact. This is at the cost of speed: while a two-photon catch the shift in electron density takes less than a femtosecond, the entire photo-cycle lasts a picosecond and at the end of the pipeline, adaptation can take seconds and color naming minutes.

An important feature were the bidirectional arrows: we have a feedback loop with control moving down and information moving up. Because of the sequence of recoding and the feedback, the receptors in the retina are not like pixels in a CCD sensor

  • Receptive field: area of visual field that activates a retinal ganglion (H.K. Hartline, 1938)
  • Center-surround fields allow for adaptive coding (transmit contrast instead of absolute values)
  • Horizontal cells presumed to inhibit either its bipolar cell or the receptors: opponent response in red–green and yellow–blue potentials (G. Svaetichin, 1956)
  • Retinal ganglion can be tonic or phasic: pathway may also be organized by information density or bandwidth

The last item comes from a table of the parvocellular and magnocellular pathways Lucia Rositani-Ronchi compiled for me at the 1993 AIC meeting in Budapest:

Originating retinal ganglion cells
Temporal resolution
Slow (sustained responses, low conduction velocity)
Fast (mostly transient responses, some sustained, high conduction velocity)
Modulation dominance
Adaptation occurs at high frequencies
Adaptation occurs at all frequencies
Receives mostly opponent type input from cones sensitive to short and long wavelengths
Receives mostly combined (broadband) input from M and L cones, both from the center and from the surround of receptive fields
Contrast sensitivity
Low (threshold > 10%)
High (threshold < 2%)
LGN cell saturation
Linear up to about 64% contrast
At 10%
Spatial resolution
High (small cells)
Low (large cells)
Spatio-temporal resolution
When fixation is strictly foveal, extraction of high spatial frequency information (test gratings), reflecting small color receptive fields
Responds to flicker
Long integration time
Short integration time
Relation to channels
Could be a site for both a lightness channel as for opponent-color channels. The role depends on the spatio-temporal content of the target used in the experiment
Might be a site for achromatic channels because the spectral sensitivity is similar to Vλ, it is more sensitive to flicker, and has only a weak opponent color component
Possible main role in the visual system
Sustain the perception of color, texture, shape, and fine stereopsis
Sustain the detection of movement, depth, and flicker; reading of text

We have four retinal pigments (erythrolabe, chlorolabe, cyanolabe, rhodopsin) attached by a lysine to a protein backbone. These four pigments are sensitized to photons at 4 energy levels (wavelengths): L, M, S, and rods. The energy levels are not numbers but distributions, namely the probabilities for a photon catch with that chromatophore.

A 3-dimensional signal with L, M, S is not efficient because we need a high spatial resolution but the chromatic information can be at a lower resolution. This is reflected in the modulations transfer functions for the HVS and is exploited for example in image encoding, where we transform an RGB signal into a color opponent signal and then down-sample the chroma images:

CIELAB separations

In 1993, it was not known where this transformation occurs in the HVS. In fact, there is quite a bit of processing in the retina, and many details are still unknown.


In their recent paper The elementary representation of spatial and color vision in the human retina, Ramkumar Sabesan et al. report on a seminal study to locate where in the HVS, the luma-chroma encoding occurs in the parvocellular pathway.

Using an adaptive optics scanning laser ophthalmoscope (AO-SLO), the authors studied 174 L-cone, 99 M-cone, and 12 S-cone samples by stimulating them individually with a 543 nm, 500 ms pulse and asking two subjects to report the perceived color name. The names were restricted to red, green, blue, yellow, white, and not seen.

The subjects reported achromatic sensations 61.8% of the time. When red was reported (22.5% of seen trials), it was more likely to be driven by L- than M-cones, whereas green (15.7%) was more likely to come from the excitation of M-cones. Thus, L-cones tended to signal both white and red, whereas M-cones tended to signal both white and green. The observation that these color percepts roughly align with the predictions of large-field cone-isolating stimuli suggests that the same opponent neuronal circuits may be implicated in both paradigms. This finding also supports the idea that the visual system can learn the spectral identity of individual cones.

The apparent segregation of color categories into distinct populations of cells is suggestive of a parallel representation of color and achromatic sensations. Moreover, these results imply that, for a large number of cones, their individual activation is not sufficient to produce a color. (Remember that in this experiment single cones are excited; in free vision, most cones are activated and the eye saccades, presenting a point in the visual field to several cones.)

The authors found that the cones most likely to generate strong spectral opponency in a parvocellular neuron, that is, those surrounded by cones of opposing type, were not more likely to generate red or green percepts. Rather, all these examples, when stimulated in isolation, drove achromatic percepts on a majority of the trials.

There is little doubt that the long-duration supra-threshold stimulation of individual cones here influences the firing of a number of different ganglion cell types. In particular, a multi-electrode study demonstrated that the activation of a single cone simultaneously evoked responses in both midget (parvocellular) and parasol (magnocellular) ganglion cells. The results may be particularly informative in differentiating proposals about the role of parvocellular neurons in achromatic spatial and color vision.

The study confirms the old result that the red-green system samples the visual world at a lower resolution than the achromatic system. The new results from the studies reported in the present paper are consistent with the idea that the HVS represents these two pieces of information with separate pathways that emerge as early as the photoreceptor synapse: one chiefly concerned with high-resolution achromatic vision and a second, lower-resolution color system.

The luma-chroma transformation with chroma subsampling is very important in image processing. In your opinion, does this new result allow the design of better imaging pipelines? Does this allow us to design better retinex algorithms? Join the conversation in the Trellis group.

Citation: R. Sabesan, B. P. Schmidt, W. S. Tuten, A. Roorda, The elementary representation of spatial and color vision in the human retina. Sci. Adv. 2, e1600797 (2016).

Thursday, September 1, 2016

Industrie 4.0: Entwicklungsfelder für den Mittelstand

In den 80er Jahren verschob sich die US-amerikanische industrielle Herstellung nach Fernost, vorwiegend China. Es war ein unaufhaltbares Ereignis, denn der Stand der industriellen Fertigung hatte sich seit Anfang Jahrhundert nicht mehr weiterentwickelt.

Nachdem ich in den 60er und 70er Jahre mehrere Werke in der Schweiz und Deutschland besucht hatte, alle sauber wie ein Spital und Qualitätskontrollen an jedem Schritt, und die Fachkräfte im angewandten Physik Institut kennenlernte, die jedes noch so verrückt schwierige Apparat-Teilchen mit unglaublich engen Toleranzen herstellen konnten, war mir der erste Besuch einer US-amerikanischen Fabrik ein grosser Schock, den ich wohl nie vergessen werde.

Es war die Burroughs B1700 Fabrik in Goleta bei Santa Barbara, ende 1977. Die Fabrikationshalle war laut, da die Arbeiterinnen jede ein Transistor Radio an hatte, jeder auf einem anderen Sender. Der Fussboden war schmutzig, die Arbeitsflächen voller Chips- und Kekskrümeln. Die Lötstrasse in der Mitte der Halle sah aus, als ob sie aus einem Charlie Chaplin Film herausgefallen war. Keine Arbeits-Station hatte eine Qualität-Kontrolle: am Ende der verknoteten Herstellungsstrasse war ein grosser Tisch an dem sich die fertigen Schaltkreis-Platinen stapelten. Eine Arbeiterin nahm eine nach der anderen die Platinen und klopfte sie mit einem Bleistift nach kalten Lötstellen ab, denn auf dem Ende des Bleistifts war ein Schreibmaschinen-Radiergummi aufgeschoben.

Nun wusste ich, weshalb wir die Computer in die Schweiz erst im Geschäft aufbauen mussten und dann die Defekte suchen mussten, bevor wir sie den Kunden liefern konnten. Praktisch jede Platine musste nachgelötet werden. Ein Jahrzehnt später, war die Fabrik bei Xerox in Rochester immer noch auf dem gleichen Niveau, nur wurde dort mehr mit dem Hammer gearbeitet um bei der manuellen Montage die fehlenden Toleranzen zu überwältigen.

Mit diesem Hintergrund, stand der Verlagerung nach Fernost mit den niedrigen Löhnen nichts im Wege. Ich glaubte, dass sich die industrielle Fertigung in den USA nie erholen würde. Dies ist ganz anders als in Europa oder Japan, wo das gesellschaftliche Schwergewicht auf Schulung der Arbeiter und Verbesserung der Herstellungs-Technologien liegt. Das Resultat ist, dass zum Beispiel der Zughersteller Stadler Rail einen Grossauftrag aus den USA erhalten konnte: Das Bahnunternehmen Caltrain hat bei Stadler 16 Doppelstocktriebzüge für 551 Millionen Dollar bestellt. Die Züge werden zwischen San Francisco und dem Silicon Valley fahren, wie der Tages Anzeiger vom 16. August berichtete.

Aber, es bewegt sich nun doch etwas im Silicon Valley. Praktisch jede Auto Firma hat jetzt eine Niederlassung im Silicon Valley, aber nicht nur die klassischen Firmen: die Chinesischen Firmen die die amerikanischen Produkte anfertigen, investieren hier in Forschungs- und Entwicklungs-Labors um die neuesten Technologien zu konzepieren. Während deutsche Hersteller immer noch 7 Jahre brauchen um ein neues elektronisches Auto zu entwickeln, kann man im Silicon Valley frischgebackene Doktoren von Stanford und UC Berkeley anstellen, die die gleiche Arbeit in ein paar Jahre machen, denn mit Industrie 4.0 haben wir eine Revolution.

Gerade letztes Wochenende hatte es im New York Times einen Artikel, wie die General Electric ein Labor mit 1400 Software-Entwicklern in San Ramon aufgestellt hat, um das System Predix zu schmieden, welches nich nur eine Grundlage für Industrie 4.0 sein soll, sondern auch es ermöglichen soll, die Wartung der Produkte im Einsatz vorherzusagen.

Das Fraunhofer-Institut für Produktionstechnik und Automatisierung IPA baut nun das Applikationszentrum Industrie 4.0 auf. Jedes Unternehmen sollte die Studie Industrie 4.0: Entwicklungsfelder für den Mittelstand anfordern und sie sorgfältig lesen. Sie ist von Herrn Markus Bressner, markus.bressner(at) erhältlich, Telefon +49 711 970-1808.

Industrie 4.0: Entwicklungsfelder für den Mittelstand

Thursday, August 18, 2016

Typesetting Sweave documents with bibliographies

When you just have to make a quick plot, you can script R on the console, but when you do an experiment and want to be able to reproduce it, scripts are no good. I architect a solution and then use an IDE like RStudio to write a program. Once I have the wrangling and helper functions working the way I need, I copy them from their R files into a Sweave template and continue from there.

Today I got stuck because I was using a bibliography and RStudio could not typeset it.

It turns out, that RStudio just appears to run the LaTeX typesetting command, not the pdflatexmk script, so Biber is not called. When you want to work fast and just point and click instead of typing on the command line, you double-click on the tex file and typeset it in TeXShop.

Unfortunately, that does not work out-of-the-box because the Sweave package is part of the R distribution but not the TeXLive distribution. The solution is to make a copy of the two Sweave style files in your local TeXLive texmf directory. Here are the four steps for the current MacOS versions:

  1. go to /Library/Frameworks/R.framework/Resources/share/texmf/tex/latex
  2. copy Rd.sty and Sweave.sty
  3. go to ~/Library/texmf/tex/latex and paste the two files
  4. do a sudo texhash

Now you can typeset your Sweave documents both in Studio and TeXShop. The latter is handy only when you need to redo the bibliography, glossary, or index: you can keep working just in RStudio.

The very first line in your LaTeX file, before the class declaration, should be

% !TEX TS-program = pdflatexmk

Quantile-Quantile Plot

Tuesday, August 9, 2016

A new open forum for scientists working on color

The American Association for the Advancement of Science (AAAS) is setting up a new platform for scientific collaboration called Trellis. A key feature is that you can upload papers you want to discuss and anybody in the group can read the paper online: the AAAS takes care of all the copyright issues with the journal publisher. Another feature is that you can control how much noise you get from Trellis. Messages can have up to 2500 characters, but if you have longer text you can create a PDF and upload it like a paper. The same holds for images and videos.

Many AAAS members are teachers: if you do crowd-sourced experiments, you can easily find subjects. The AAAS is also involved in policy making in Washington, in case you need help with that. You can also announce conferences and other scientific events, and use the shared calendar.

Trellis is currently in pilot phase for educators, policy makers, and Section T of AAAS (Information, Computing, and Communication). There are still a few rough edges that are being ironed out. That might be why you have not yet heard of Trellis.

I am setting up a group called Computational Color Science in Section T. However, we are trying out something new. We are making it a completely open group, i.e., anybody with the URL can sign up and participate, without having to be an AAAS member. To join, go to You can invite anybody else you want by giving them the URL and they can sign up.

Monday, August 8, 2016

Retina from iPS cells

On 29 July 2016, the Japan News by the Yomiuri Shinbun reported that the transplant of iPS cells has been approved in Kobe.

The ethics committee of the Kobe City Medical Center General Hospital has approved a surgery plan to transplant retina developed from donor-derived induced pluripotent stem (iPS) cells to an eye-disease patient. The hospital in Kobe will aim to carry out the surgery, part of the world’s first clinical study of iPS cells, in the first half of 2017, after getting the green light from the government.

The subject of the treatment will be a patient with age-related macular degeneration, a serious disease that can lead to blindness. The Riken Center for Developmental Biology, or CDB, will generate retina from iPS cells, supplied by Kyoto University’s Center for iPS Cell Research and Application, from donors with no blood ties to the patient.

Link to the article

Thursday, July 21, 2016

Virtual energy production and delivery at IWB

In our post on the virtual print factory, we saw how an art director can be simulated and used in a simulated press check to optimize print production. There are other applications that can be revolutionized through the online simulation of production. Today, we have a look at the future of energy production and delivery. Society is improved by making it more efficient.

In 1852 Basel, the private company Gasindustrie was founded and in 1867 it was nationalized. Over the years other utilities were added: water delivery, water production, electricity, long-distance heating, refuse processing, and a fiber network for broadband internet and telephony. The name became IWB, forIndustrielle Werke Basel. It was privatized in 2010 (CEO David Thiel), but all shares belong to the Canton Basel-City. IWB is responsible for the supply of energy, water, and telecom; it has a mandate to optimize its operations (smart IWB 2020).

During industrialization, like in most countries, Switzerland's main energy source was coal. After World War I, not having coal mines, Switzerland boosted the education of engineers, who could then electrify the country. For example, the Crocodile locomotive was an engineering feat that could pull up a freight train on the Gottardo line. Actually, the regenerative braking energy from two trains could pull up one train on the other side of the Alps. When in the 1930s the regime in Germany started flexing its muscle and using its coal to wield power, Switzerland invested considerable brain power to wean away from coal as much as possible. For example, the cantonal buildings in Zurich are heated with heat pumps extracting heat from the Limmat.

The ETH cranked out generation after generation of skilled engineers who designed hydroelectric dams, turbines, and power distribution systems. Many plants were of the pump type, consisting of an upper and a lower reservoir: during the day water falls and generates power, while at night cheap electricity is imported from fixed throughput plants to pump the water back up.

This history is reflected in IWB's energy sources. In 2015, the energy sources for electricity in percent were

other renewable

In the 4th quarter of 2015, on the European domestic market, the cost of a kilowatt-hour (kWh) of power was 3.3 cents. However, in Basel, at the public car charge boxes, the consumer price varies between 45 and 70 cents per kWh. This is an opportunity to increase efficiency. Smart IWB 2020 aims at reducing and stabilizing the end-user energy costs.

The old big central power plants will remain and keep producing and storing energy. New small decentralized systems have been built to produce and store energy at the regional level. Now, end-users are starting to produce, store, and consume energy. Excess energy is shared at the neighborhood level.

This is how a 1 MW lithium-metal-oxide battery in Zurich looks like:

1 MW lithium-metal-oxide battery in Zurich

End-users must also store energy in some form to even out the network dependence during the day. There may be excess solar energy in the afternoon and a lack of energy during a cold winter night.

each house has a facility to store excess energy

This is made possible by the new control network shown in dark gray in the figure below (for an animation see here). IWB can collect data everywhere on the network and feed it to its simulation that allows it to optimize the overall energy generation and conversion system.

Smart IWB 2020. The control network is shown in dark grey

Electricity companies have been using simulations for many years. For robustness, a distribution system cannot have a tree topology, because a failure at a node will black out the entire subtree. The required mesh topology is difficult to manage because the system has to be kept in equilibrium, otherwise, a failure will cascade to a blackout of the entire network.

What is new with smart IWB 2020, is that the regulation is no longer made by dropping more water when the network frequency drops under 49.8 Hz and by pumping up water when the frequency rises over 50.2 HZ. As the figure above shows, there are many more sources for electricity that have to be synchronized and balanced out.

In 2000, Germany introduced a law to subsidize renewable energies by guaranteeing the producers a profit, i.e., by taking out a major part of their risk to conduct business. In 2004, the European Union liberalized the power market, adding to the mix the windmill farms in Denmark among others. In Germany alone, renewable energy production surged from 6,277 GWh in 2000 to 153,000 GWh in 2015.

The availability of this low-cost renewable energy from the north wrecked havoc in the business model of the Swiss generators, who were generating expensive electricity during the day by draining the high reservoirs and importing cheap electricity at night to pump up the water from the low reservoirs. Today, solar plants in Germany deliver the maximum energy around noon, exactly the time when pump plants in the Alps used to generate the highest profits.

According to Alpiq CEO Jasmin Staiblin (SFR 25 April 2016), the producer Alpiq can generate only ¼ of its hydropower at a profit, while ½ breaks even and ¼ is sold at a loss. On average, to Alpiq, hydropower generation costs 6.5 cents per kWh, twice the European market price. Even at its newest generation facilities with the latest turbine designs, the cost is 3.8 respectively 4.5 cents per kWh. Alpiq expects that next year or the year after, the open market price will sink to 2 cents per kWh or even slightly less.

The numerous nuclear power plants in France and Switzerland, while also causing losses to the hydroelectric generators, cannot compete with the renewable sources. At the Gösgen nuclear power plant, production costs in 2014 were 3.4 cents per kWh. In 2015 they were 5.1 cents per kWh, but this was due to accounting changes and costs are supposed to sink again, but 3.4 > 2. According to GE Chief Productivity Officer Philippe Cochet in Fairfield CT (NZZ 13 January 2016), before the 2008 financial crisis, in Europe each year 7 GW of new power generation capacity was sold; after the crisis, sales dropped to 1.3 GW and in the past two years sales were less than 1 GW.

The solution is to use online simulations to not just optimize electric power generation, but all energy management: electricity, hot steam for electricity production, hot water for heating, and warm water for washing. Heat is produced by burning refuse, natural gas, biomass (wood refuse), etc. It is also recovered from data centers, instead of dispersing it in the atmosphere through air conditioning chillers. This photograph by Mathias Leemann shows the refuse burning plant of Basel.

Refuse burning plant Basel; photo by Mathias Leemann

Heat can be stored in water, soil, and stones, as has been done since Roman times. A more contemporary method used by IWB is the use of fuel cells. When there is excess electricity, electrolysis of water is used to generate hydrogen. Hydrogen is also produced from natural gas when consumption is low. This hydrogen is easy to store. When electricity prices are high, hydrogen fuel cells are used to generate electricity.

Coordinating and timing all these sources, stores, carriers, and consumers of energy is a very complex task. When IWB will sell its electricity at the public charge boxes (photograph by Simon Havlik) around the Canton for a much lower price than today's 45 to 70 cents per kWh, cars based on burning fossil fuels will disappear very fast. Such is the impact of smart energy management.

IWB charge box; photograph by Simon Havlik

So far, we have seen how the online simulation of a complex energy provision system can considerably reduce the cost of energy. However, this does not yet help with the goal of the 2000 Watt Society. If we build our houses with recycled glass shards in the outer concrete walls, then use 12 cm of insulation and cover it with 16 cm of solid wood on the inside, and also give up private ownership of cars, we might achieve a 3500 Watt Society, said ZHAW sustainability expert Prof. Andreas Hofer (SRF 26 November 2015).

50 years ago, people got by with less than 2000 Watt. Where is the problem? It is not at the individual level but at the society level. We have become much more mobile: if you live in Lugano, you not longer go to San Moritz for an extended weekend, but to Paris. Also, we have become digital packrats. All over the world, we have huge server farms that store all that digital media we never consume but is valuable for social network companies to dissect our lives and sell us stuff we do not really need.

Back to the virtual print factory:

virtual print factory

The output of the prepress stage is a PDF file. The two presses take raster images, therefore the computer in front of the press has to do the ripping and is called the digital front-end. In John L. Recker et al.; Font rendering on a GPU-based raster image processor; Proc. SPIE 7528 (January 18, 2010), the authors calculated that over a year of usage, the regular front-end RIP consumed 38,723 kWh and generated 23,234 Kg of CO2, while for the GPU-RIP they built, the corresponding numbers are 10,804 kWh and 6,483 Kg.

This is the kind of innovation that is required to achieve the 2000 Watt Society at the society level rather than at the individual level. There is still a lot of work to do. We recently wrote that the internet of things is a power guzzler: fortunately the cited report has some good advice.

Tuesday, July 19, 2016

Structure in Unstructured Data, Part 2

In the first part, we took a random walk from unstructured data to multimedia files, JPEG compression, a DCT-inspired classifier, and deep learning. We saw that the crux of supervised machine learning is the training.

There are two reasons for needing classifiers. We can design more precise algorithms if we can specialize them for a certain data class. For humans, the reason is that our immediate memory can hold only 7±2 chunks of information. This means that we aim to break down information into categories each holding 7±2 chunks. There is no way humans can interpret the graphical representation of graphs with billions of nodes.

As already Immanuel Kant noted, categories are not natural or genetic entities, they are purely the product of acquired knowledge. One of the functions of the school system is to create a common cultural background, so people learn to categorize according to similar rules and understand each other's classifications. For example, in the biology class, we learn to organize botany according to the 1735 Systema Naturæ compiled by Carl Linnæus.

As we know from Jean Piaget's epistemological studies with children, there is assimilation when a child responds to a new event in a way that is consistent with an existing classification schema. There is accommodation when a child either modifies an existing schema or forms an entirely new schema to deal with a new object or event. Piaget conceived intellectual development as an upward expanding spiral in which children must constantly reconstruct the ideas formed at earlier levels with new, higher order concepts acquired at the next level.

The data scientist's social role is to further expand this spiral.

For data, this means that we want to cluster it (recoding by categorization). Further, we want to connect the clusters in a graph so we can understand its structure (finding patterns). At first, clustering looks easy: we take the training set and do a Delaunay triangulation, the dual graph of the Voronoi diagram. After building the graph with the training set, for a new data point, we just look in which triangle it falls and know its category. Color scientists are familiar with Delaunay triangulations because they are used for device modeling by table lookup. Engineers use them to build meshes for finite element methods.

The problem is that the data is statistical. There is no clear-cut triangulation and points from one category can lie in a nearby category with a certain probability. Roughly, we build clusters by taking neighborhoods around the points and the intersect them to form the clusters. The crux is to know what radius to pick for the neighborhoods because the result will be very different.

This is where the relatively new field of algebraic topology analytics comes into play. It has only been about 15 years that topology has started looking at point clouds. Topology, an idea of the Swiss mathematician Leonhard Euler, studies the properties of shape independent of coordinate systems, dependent only on a metric. The topological properties are deformation invariant (a donut is topologically equivalent to a mug). Finally, topology constructs compressed representations of shape.

The interesting element of shape in point clouds are the k-th Betti numbers βk, the number of k-dimensional "holes" in a simplicial complex. For example, informally β0 is the number of connected components, β1 the number of roundish holes, and β2 the number of cavities.

Algebraic topology analytics relieves the data scientist from having to guess the correct radius of the point neighborhoods by considering all radii and retaining only those that change the topology. If you want to visualize this idea, you can think of a dendrogram. You start with all the points and represent them as leaves; as the radii increase, you walk up the hierarchy in the dendrogram.

This solves the issue of having to guess a good radius to form the clusters, but you still have the crux of having to find the most suitable distance metric for your data set. This framework is not a dumb black-box: you still need the skills and experience of a data scientist.

The dendrogram is not sufficiently powerful to describe the shape of point clouds. The better tool is the set of k-dimensional persistence barcodes that show the Betti numbers in function of the neighborhood radii for building the simplicial complexes. Here is an example from page 347 in Carlsson's article cited below:

(a) Zero-dimensional, (b) one-dimensional, and (c) two-dimensional persistence barcodes

With large data sets, when we have a graph, we do not necessarily have something we can look at because there is too much information. Often we have small patterns or motifs and we want to study how a higher order graph is captured by a motif. This is also a clustering framework.

For example, we can look at the Stanford web graph at some time in 2002 when there were 281,903 nodes (pages) and 2,312,497 edges (links).

Clusters in the Stanford web graph

We want to find the core group of nodes with many incoming links and the tied together periphery groups that are tied together and also up-link to the core.

A motif that works well for social network kind of data is that of three interlinked nodes. Here are the motifs with three nodes and three edges:

Motifs for social networks

In motif M7 we marked the top node in red to match the figure of the Stanford web.

Conceptually, given a higher order graph and a motif Mi, the framework searches for a cluster of nodes S with two goals:

  1. the nodes in S should participate in many instances of Mi
  2. the set S should avoid cutting instances of Mi, which occurs when only a subset of the nodes from a motif are in the set S

The mathematical basis for this framework are motif adjacency matrices and the motif Laplacian. With these tools, a conductance metric in spectral graph theory can be defined, which is minimized to find S. The third paper in the references below contains several worked through examples for those who want to understand the framework.

Further reading:

Monday, July 18, 2016

Structure in Unstructured Data, Part 1

In the context of big data, we read a lot about structured versus unstructured data. So far, so good. Things get a little murky and confusing when advanced analytics—which refers to analytics for big data—joins the conversation. The confusion comes from the subtle difference between "structured data" and "structure of data," which contain almost the same words. Both concepts are key to advanced analytics, so they often come up together. In this post, I will try shed some light on this murkiness to illuminate it.

The categorization in structured, semi-structured, and unstructured data comes from the storage industry. Computers are good at chewing on large amounts of data of the same kind, like for example the readings from a meter or sensor, or the transactions on cash registers. The data is structured in the sense that each record has the same fields at the same locations, for example on an 80 or 96 column punched card, if you want a visual image. This structure is described in a schema.

Databases are optimized for storing structured data. Since each record has the same structure, the location of the i-th record on the disk is i times the record length. Therefore, it is not necessary to have a file system: a simple block storage system is all that is needed. When instead of the i-th record we need the record containing a given value in a given field, we have to scan the entire database. If this is a frequent operation in a batch step, we can accelerate it by first sorting the records by the values in this field, which allows us to use binary search, which is logarithmic instead of linear.

Because an important performance metric is the number of transactions per second, database management systems use auxiliary structures like index files and optimized query systems like SQL. In a server-based system, when we have a query, we do not want to transfer the database record by record to the client: this leads to server-based queries. When there are many clients, often the same query is issued from various clients, therefore, caching is an important mechanism to optimize the number of transactions per second.

Database management systems are very good at dealing with transactions on structured data. There are many optimization points that allow for huge performance gains, but it is a difficult art requiring highly specialized analysts.

With cloud computing, it has become very easy to quickly deploy an application. The differentiation is no longer by the optimization of the database, but in being able to collect and aggregate user data so it can be sold. This process is known as monetization and an example is click-streams. The data is to a large extent in the form of logs, but their structure is often unknown. One reason is that the schemata often change without a notification because the monetizers infer them by reverse engineering. Since the data is structured with an unknown schema, it is called semi-structured. With the Internet of Things (IoT), also known as Web of Things, Industrial Internet, etc., a massive source of semi-structured data is coming to us.

This semi-structured data is high-volume and high-velocity. This breaks traditional relational databases because data parsing and schema inference become a performance bottleneck. Also, the indexing facilities may not be able to cope with the data volume. Finally, the traditional database vendor's pricing models do not work for this kind of data. The paradigms for semi-structured data are column based storage and NoSQL (not only SQL).

The ubiquity of smartphones with their photo and video capabilities and connectedness to the cloud has brought a flood of large data files. For example, when the consumer insurance industry thought it can streamline its operations by having insured customers upload images of damages instead of keeping a large number of claim adjusters in the field, they got flooded with images. While an adjuster knows how to document a damage with a few photographs, consumers take dozens of images because they do not know what is essential.

Photographs and videos have a variety of image dimensions, resolutions, compression factors, and duration. The file sizes vary from a few dozen kilobytes to gigabytes. They cannot be stored in a database other than as a blob, for binary large object: the multimedia item is stored as a file or an object and the database just contains a file pathname or the address of an object.

In juxtaposition to conventional structured data, the storage industry talks about unstructured data.

Unstructured data can be stored and retrieved, but there is nothing else that can be done with it when we just look at it as a blob. When we looked at analytics jobs, we saw that analysts spend most of their time munging and wrangling data. This task is nothing else than structuring data because analytics is applied to structured data.

In the case of semi-structured data, this consists in reverse engineering the schema, convert dates between formats, distinguish numbers and strings from factors, and dealing correctly with missing data. In the case of unstructured data, it is about extracting the metadata by parsing the file. This can be a number of tags like the color space, or it can be a more complex data structure like the EXIF, IPTC, and XMP metadata.

A pictorial image is usually compressed with JPEG and stored in a JFIF file. The metadata in a JPEG image consists of segments beginning with a marker, the kind of the marker, and if there is a payload, the length and the payload itself. An example of a marker kind is the type (baseline or progressive) followed by width, height, number of components, and their subsampling. Other markers are the Huffman tables (HT), the quantization tables (DQT), a comment, and application-specific markers like the color space, color gamut, etc.

This illustrates that unstructured data contains a lot of structure. Once the data wrangler has extracted and munged this data, it is usually stored in R frames or in a dedicated MySQL database. These allow processing with analytics software.

Analytics is about finding even deeper structure in the data. For example, a JPEG image is first partitioned in 8×8 pixel blocks, which are each subjected through a discrete cosine transformation (DCT). Pictorially, the cosine basis (the kernels) looks like this:

the kernels of the discrete cosinus transform (DCT)

The DCT transforms the data into the frequency domain, similar to the discrete Fourier transform, but in the real domain. We do this to decorrelate the data. In each of the 64 dimensions, we determine the number of bits necessary to express the values without perceptual loss, i.e., in dependence of the modulation transfer function (MTF) of the combined human visual system (HVS) and viewing device. These numbers of bits are what is stored in the discrete quantization table DQT, and we zero out the lower order bits, i.e., we quantize the values. At this point, we have not reduced the storage size of the image, but we have introduced many zeros. Now we can analyze statistically the bit patterns in the image representation and determine the optimal Huffman table, which is stored with the HT marker, and we compress the bits, reducing the storage size of the image through entropy coding.

Like we determine the optimal HT, we can also study the variation in the DCT-transformed image and optimize the DQT. Once we have implemented this code, we can use it for analytics. We can compute the energy in each of the 64 dimensions of the transformed image. As a proxy for energy, we can compute the variance and obtain a histogram with 64 abscissa points. The shape of ordinates gives us an indication of the content of the image. For example, the histogram will tell us, if an image is more likely scanned text or a landscape.

We have built a classifier, which gives us a more detailed structural view of the image.

Let us recapitulate: we transform the image to a space that gives a representation with a better decorrelation (like transforming from RGB to CIELAB). Then we perform a quantization of the values and study the histogram of the energy in the 64 dimensions. We start with a number of known images and obtain a number of histogram shapes: this is the training phase. Then we can take a new image and estimate its class by looking at its DCT histogram: we have built a classifier.

We have used the DCT. We can build a pipeline of different transformations followed by quantizations. In the training phase, we determine how well the final histogram classifies the input and propagate back the result to give a weight to each transformation in the pipeline by adjusting the quantization. In essence, this is an intuition for what happens in deep learning.

A disadvantage of machine learning is that the training phase takes a long time and if we change the kind of input we have to retrain the algorithm. For some applications, you can use your customers as free workers, like for OCR you can use the training set as captcha, which your customers will classify for free. For scientific and engineering applications you typically do not have the required millions of free workers. In the second part, we will look at unsupervised machine learning.

Wednesday, July 13, 2016


One way of categorizing computer users is to partition them into consumers and producers. Consumers follow their friends on social networks, watch movies, and read the news. Producers create the contents for the consumers, either as chroniclers or as copywriters.

The former enter small amounts of text into the device, so they typically give the finger to a smartphone or tablet; this finger often being a thumb or an index (thumbing). The latter need to be efficient when entering bulk data, so they typically use a desktop computer or a laptop because they come with a keyboard, allowing them to type with all ten fingers, without looking at the keyboard (touch typing).

Although a producer will mostly be touch typing, the user interfaces are mostly graphical and use a paradigm known as WIMP, for windows, icons, mice, and pointing. A mode or context change requires removing one hand from the keyboard to grab the mouse. Since this takes a longer time than moving a finger to a different key, GUIs have keyboard shortcuts. Mousing is exacerbated by today's big screens, which make it harder to locate the pointer.

Hypertext is based on links. A link is followed by clicking on it, which requires moving a hand from the keyboard to the mouse and finding the pointer. This can be annoying in activities like doing research using a search engine while summarizing the results and typing them into a text editor.

Life is easier when each link is labeled with a letter and a link can be followed by pressing that letter on the keyboard. This is what you can do with ZimRim, a free application from a Silicon Valley data science startup of the same name.

The result screen of ZimRim; click to enlarge

ZimRim's result screen is a scroll-free view with all 10 links appearing on one screen: on most laptops / desktops you do not need to scroll up and down to see and compare the links. A user can compare all 10 results in one glance and decide which are best fit for their query. It is clutter free with a uniform look and currently ad-free.

Results are opened in separate tabs so as to keep the results page open as the reference to open other links so you do not have to press "back" button. If results do not open, users should look for "popup blocked" message below the address bar and allow popups from this domain. Some browsers mistakenly block opening new tabs for result links thinking of those as potential popup ads.

ZimRim makes whole search experience "mouse optional" as a bonus for producers although consumers / mouse users can click the usual way.

Wednesday, June 22, 2016

9,355,339: system and method for color reproduction tolerances

In the early days of digital imaging for office use, there was a religion that the response of the system had to be completely neutral. The motivation came from a military specification that a seventh generation copy had to still look good and be readable. If there would be a deviation from neutrality, every generation would enhance this deviation.

When a Japanese manufacturer entered the color copier market, they enhanced the images to improve memory colors and boost the overall contrast. They were not selling to the government and rightfully noted that commercial users do not do multiple generation copies.

Color is not a physical phenomenon, it is an illusion and color imaging is about predicting illusions. Visit your art gallery and look carefully at an original Rembrandt van Rijn painting. The perceived dynamic range is considerably larger than the gamut of the paints he used because he distorted the colors depending on their semantics. With Michel Eugène Chevreul's discovery of simultaneous contrast, the impressionists then went wild.

When years later I interviewed for a position in a prestigious lab, I gave a talk on how—based on my knowledge of the human visual system (HVS)—I could implement a number of algorithms that greatly enhance the images printed on consumer and office printers. The hiring engineering managers thought I was out of my mind and I did not get the job. For them, the holy grail was the perfect neutral transmission function.

This neutral color reproduction is a figment of imagination in the engineer's minds in the early days of digital color reproduction. The scientists who earlier invented the mechanical color reproduction did not have this hang-up. As Evans observed on page 599 of [R.M. Evans, “Visual Processes and Color Photography,” JOSA, Vol. 33, 11, pp. 579–614, November 1943], under the illuminator condition color constancy mechanisms in the human visual system (HVS) correct for improper color balance. As Evans further notes on page 596, we tend to remember colors rather than to look at them closely; for the most part, he notes, careful observation of stimuli is made only by trained observers. Evans concludes that it is seldom necessary to obtain an exact color reproduction of a scene to obtain a satisfying picture, although it is necessary that the reproduction shall not violate the principle that the scene could have thus appeared.

We can interpret Evans’ consistency principle (page 600) as what is important is the relation among the colors in a reproduced image, not their absolute colorimetry. A color reproduction system must preserve the integrity of the relation among the colors in the palette. In practice, this suggests that three conditions should be met. The first is that the order in a critical color scale should not have transpositions, the second is that a color should not cross a name boundary, the third is that the field of reproduction error vectors of all colors should be divergence-free. The intuition for the divergence condition is that no virtual light source is introduced, thus supporting color constancy mechanisms in the HVS.

We all know the Farnsworth-Munsell 100 hue test. The underlying idea is that when an observer has poor color discrimination, either due to a color vision deficiency or due to lack of practice, this observer will not be able to sort 100 specimen varying only in hue. In the test, the number of hue transpositions is counted to score an observer.

Farnsworth-Munsell 100 hue test

We can considerably reduce the complexity of a color reproduction system if we focus on the colors that actually are important in the specific images being reproduced. We can minimally choose the quality of the colorants, paper, halftoning algorithm, and the number of colorants to just preserve the consistency of that restricted palette.

First, we determine the palette, then we reproduce a color scale with a selection of the above parameters and we give the resulting color scale image a partial Farnsworth-Munsell hue test. This is the idea behind invention 9,355,339. The non-obvious step is how to select color scales and how to use a color management system to simulate a reproduction. The whole process is automated.

A graphic artist manually identifies the locus of the colors of interest, which is different for each application domain, like reproductions of cosmetics or denim clothes. A scale encompassing these critical colors is built and printed in the margin. In the physical application, the colors are measured and the transpositions are counted.

complexion color scale

The more interesting case is a simulation. There is no printing, just the application of halftoning algorithms and color profiles. The system extracts the final color scale and identifies the transpositions compared to the original. Why is the simulation more important?

Commercial printing is a manufacturing process and workflow design is crucial to the financial viability of the factory. Printing is a mass-customization process, thus every day in the shop is different. This makes it impossible to assume a standard workload.

For each manufacturing task, there are different options using different technologies. For example, a large display piece can be printed on a wide bubblejet printer, or the fixed background can be silkscreen printed and the variable text can be added on the wide bubblejet printer. Another example is that a booklet can be saddle stitched, perfect bound, or spiral bound.

print fulfillment diagam

In an ideal case, a print shop floor can be designed in a way to support such reconfigurable fulfillment workflows, as shown in this drawing (omitted here are the buffer storage areas).

example of a print fullfillment floor rendering; the buffer zones are omitted

However, this would not be commercially viable. In manufacturing, the main cost item is labor. Different fulfillment steps require different skill levels (at different salary levels) and take a different amount of time.

Additionally, not all print jobs are fulfilled at the same speed. A rush job is much more profitable than a job with a flexible delivery time. To stay in business, the plant manager must be able to accommodate as many rush jobs as possible.

This planning job is similar to that of scheduling trains in a saturated network like the New Railway Link through the Alps (NRLA or NEAT for Neue Eisenbahn-Alpentransversale, or simply AlpTransit). For freight, it connects the ports of Genoa and Rotterdam, with an additional interchange in Basle (the Rotterdam–Basel–Genoa corridor). For passengers, it connects Milan to Zurich and beyond. There are two separate basis tunnel systems: the older Lötschberg axis and the newer Gotthard axis. In the latter, freight trains travel at least at 100 km/h and passenger trains at least at 200 km/h. These are the operational speeds, the maximal speed is 249 km/h, limited by the available power supply.

The trains are mostly of the freight type, traveling at the lower operational speed. A smaller number of passenger trains travels at twice the operational speed. A freight train can be a little late, but a passenger train must always be on time lest people miss their connections, which is not acceptable.

The trains must be scheduled so that passenger trains can pass freight trains when these are at a station and can move to a bypass rail. However, there can be unforeseen events like accidents, natural disasters, and strikes that can delay the trains on their route. The manager will counter these problems by accelerating trains already in the system above their operational speed when there is sufficient electric power to do so. This way, when the problem is solved, there is additional capacity available.

The problem with modifying the schedule is that one cannot just accelerate trains: new bypass stations have to be determined and the travel speeds have to be fine-tuned. Train systems are an early implementation of industry 4.0 because the trains also automatically communicate between each other to avoid collisions and to optimize rail usage. For AlpTransit this required solving the political problem of forcing all European countries to adopt a new ERTMS/ETCS (European Train Control System) Level 2, to which older locomotives cannot be upgraded.

The regular jobs and rush jobs in a print fulfillment plant are similar. The big difference is that the train schedule is the same every day, while in printing each day is completely different. The job of the plant designer is to predict the bottlenecks and do a cost analysis to alleviate these bottlenecks. In particular, deadlocks have to be identified and mitigated. There are two main parameters: the number and speed of equipment, and the amount of buffer space. Buffering is highly nonlinear and cannot be estimated by eye or from experience. The only solution is to build a model and then simulate it.

We used Ptolemy II for the simulation framework and wrote Java actors for each manufacturing step. To find and mitigate the bottlenecks, but especially to find the dreaded deadlock conditions, we just need to code timing information in the Java actors and run Montecarlo simulations.

We used a compute cluster with the data encrypted at rest and using Ganymed SSH-2 for authentication with certificates and encryption on the wire. Each actor could run on a separate machine in the cluster. The system allows the well-dimensioned design of the plant, its enhancement through modernization and expansion, and the daily scheduling.

So far, the optimization is just based on time. In a print fulfillment plant, there are also frequent mistakes in the workflow definition. The workflow for a job is stored in a so-called ticket (believe me, reaching a consensus standard was more difficult than for ERTMS/ETCS). One of the highest costs in a plant is an error in the ticket, which causes the job to be repeated after the ticket has been amended. With this, risk mitigation through ticket verification is a highly valuable function, because it allows a considerable cost reduction for not having to allocate insurance expenses.

While in office printing A4 or letter size paper are the norm, commercial printers use a possibly large paper size to save printing time and, with it, cost. This means there are ganging and imposition, folding, rotating, cutting, etc. It is easy to make an imposition mistake and pages end up in the wrong document or at the wrong place. Similarly, paper can be cut at the wrong point in the process or folded incorrectly.

Once we have a simulation of the print fulfillment factory, we can easily solve these workflow problems, thus reducing risk and with it insurance cost. The data for print jobs is stored in portable document format (PDF) files. For each workstation in a print fulfillment plant, we take the Java actor implemented for the simulation and add input and output ports for PDF files. We then implement a PDF transformer for each workstation that applies the work step to the PDFs. There can be multiple input and output PDFs. For example, a ganging workstation takes several PDF files and outputs a new PDF file for each imposed sheet.

Most errors happen when a ticket is compiled. After simulating the workflow, the operator simply checks the resulting PDF files. A mistake is immediately visible and can be diagnosed by looking at the intermediate PDF files. A more subtle error source is when workstations negotiate workflow changes in the sense of the industry 4.0 technology. Before the change can be approved, the workflow has to be simulated again and the difference between the two output PDF files has to be computed.

A more valuable, but also more complex workflow change is accommodating rush jobs by taking shortcuts. For example, if there is a spot color in the press, can we reuse the same color or do we have to clean out the machine to change or remove the spot color? Another example is the question of using dithering instead of stochastic halftoning to expedite a job. Finally, earlier we mentioned the possibility of running a fixed background through a silkscreen press and printing only the variable contents on a bubblejet press.

In a conventional setting, any such change requires doing a new press-check and having the customer come in to approve it. In practice, this is not always realistic and the owner will use his best judgment so self-approve the proof.

9,355,339 automates this check and approval. The ICC profiles are available and can be used to compute the perceived colors in each case. The transposition score for the color scale (there can be more than one) can predict the customer's approval of the press-check.

9,355,339 automates the press-check and approval

Thus, we have created a Java actor that simulates a human and can predict the human's perception. This is an industry 4.0 application where we not only have semantic models for the machines but also for the humans, and the machines can take the humans into consideration.

simulating the human in the loop