Thursday, January 28, 2016

Impact of New Developments of Colour Science on Imaging Technology

Yesterday afternoon, at the Stanford Center for Image Systems Engineering, Dr. Joyce Farrell hosted Prof. M. Ronnier Luo for an update on the latest activities at the International Commission on Illumination (CIE), of which he is the Vice-President. He focussed on the aspects relevant to imaging.

Division 7, terminology, has been disbanded because it has finished its work. The e-ILV can be accessed at this link.

There is a new CIE 2006 physiologically based observer model with XYZ functions transformed from the CIE (2006) LMS functions. These functions are linear transformations of the cone fundamentals of Stockman and Sharpe, the 10º LMS fundamental colour matching functions. In the plot below, you can see the 2º XYZ CMFs transformed from the CIE (2006) LMS cone fundamentals. Note the different shapes around 450 nm compared to the 1931 and 1964 observer models.

XYZ CMFs transformed from the CIE (2006) LMS cone fundamentals

The new model is a pipeline in whose stages the age-related parameters can be set. The 10º LMS functions are corrected for the absorption of the ocular media and the macular pigment, and take into account the optical densities of the cone visual pigments, all for a 10° viewing field, yielding the low-density absorbance functions of these pigments. Using these low-density absorbance functions one can derive, taking into account the absorption of the ocular media and the macula, and taking into account the densities of the visual pigments for a 2° viewing field, the 2° cone fundamentals.

There is also a new luminous efficiency function V(λ), which has changed mostly in the blue region.

There are new scales for whiteness and blackness, which corresponds to those in the NCS system. They are based on the comprehensive CAM16 appearance model. Considering a hue leaf of CIELAB in cylindrical coordinates, the south–east ↘ diagonal scale is whiteness–depth and the north–east ↗ is blackness–vividness. These new scales are particularly useful in imaging for adjusting complexion. The skin colors of Asian and Caucasian people vary along the whiteness–depth scale and those of African people vary along the blackness–vividness scale.

Next, Ronnier explained the new color rendering index (CRI) that works also for LED light sources. He also presented a very compelling demonstration of the apparatus used to develop the standard. The new color rendering index is called CRI 2010 and IESNA-TM40. It is based on the measurement of 99 test samples.

I was a little disappointed that the new CRI is still based on colorimetry and not on spectral data. Using colorimetry is an analytical process and having a much larger number of samples helps. However, it does not allow a full characterization of a light source, as we learned many years ago with the tri-band fluorescent lamps. They use less energy, but at the cost of quality.

In this case, I am not too much of a fan of the energy reduction because in practice when you reduce the cost of running a light, people will just deploy more lights and in the end you do not save energy. This is so in consumer applications and does not hold for industrial applications.

Our environment is not made out of BICRA tiles and usually, we are not in aperture mode. We perceive complex images and the light from a set of spot lamps modulates our ambient. While in the case of OLED or fluorescent lamps we might have diffuse light, with LEDs and conventional halogen spot lamps we have more of a set of directed sources with a rapid fall-off.

The rooms in my house are painted in a fusion Italian and Japanese style. The colors are vivid (Italian style), but the paints have a very peaked spectrum so the color is modulated by the illumination (Japanese style). We use older high-quality LED sources with two different green phosphors (the additional one is based on Europium), which we dim. The visual effect is similar to candlelight, except for the correlated color temperature (CCT).

From my experience, I think that a CRI model should include the difference between the spectral distributions of the light source and the reference illuminant. I would also like to have two different reference distributions, A for mood light and D for work light. For thousands of years, we have evolved performing work in daylight and relaxing in blackbody radiator light from fires, oil lamps, and candles. When we want to be in a cozy mood, we pull out the candles, which is also common in upscale restaurants. Candles are more expensive and dangerous than LEDs in houses built from flammable materials.

Should the new CRI also have a provision for the blue hour? Ronnier concluded his presentation stating that the new research topic is tunable white.

Monday, January 25, 2016

The Talented Silicon Valley

The Silicon Valley is not an institution, which tend to be rigid. There have been several attempts to clone the Silicon Valley as an institution, for example, Sophia Antipolis in France and Tsukuba Science City in Japan, but they have not been successful, at least as compared to the impact on society that the Silicon Valley has.

The Silicon Valley is a biotope, which is relentlessly evolving. If you want an economic force like the Silicon Valley, you have to create a habitat for your own ecological system.

If we look at Silicon Valley's evolution, it started with first class educational institutions like Stanford University (est. 1891, motto "die Luft der Freiheit weht" freely following videtis illam spirare libertatis auram) and UC Berkeley (est. 1868, motto "fiat lux"), available capital, and intrapreneurial professors like Frederick Terman (1900–1982), who is credited (with William Shockley) with being the father of Silicon Valley. Terman's doctoral advisor was Vannevar Bush and his notable students included Russell and Sigurd Varian and the HP triad William Hewlett, David Packard and Bernard M. Oliver.

People are the living beings in the Silicon Valley biotope. The brightest minds are attracted and nurtured. Attraction is not accomplished with money, but the recognition and grooming of talent, where people are selected only on the basis of their ability to create insanely great products and are nurtured to fulfill their intellectual potential. In the Silicon Valley, people do not try to predict the future: they have the passion for building it.

Nurturing takes place through an intellectual climate where ideas can flow freely and people see each others as challenging colleagues rather than enemies, even when they are competitors. The open flow of ideas happens through myriad conferences, seminars, meet-ups, dojos and incubators, and even cafes. For example, adjacent to the Samsung R&D building is the famous Hacker Dojo, on the site of HP's first building (Redwood Building on 395 Page Mill Road) is the AOL incubator, and on University Avenue SAP has transformed the New Varsity Theater into Hana Haus.

For an individual, it might not be a tragedy when they are employed below their intellectual potential. The ability to accomplish tasks much faster than their co-workers will yield some freedom and allow for less supervision of their work. However, the intelligent people will be missing in important functions in a company. The society as a whole develops a problem when less intelligent people have to step into senior management positions. Fulfilled potential is called talent, and the Silicon Valley is good at developing talent through mentoring.

Last but not least, this open intellectual climate and talent development make any work very productive and efficient, because when you need to know something, you know whom to ask. You do not have to spend days googling the Internet for an answer that may be incorrect. You get your answer immediately— maybe when it is complex, at the cost of a coffee or a beer.

This is life in the biotope. A characteristic of the ecosystem has always been its rapid evolution. During the cold war and the quest to outbrain the Russians, high-risk research was possible because the government agencies paid cost plus and it was not necessary to worry about commercializing products for the consumer market. When world politics changed, institutions like SRI, IBM Almaden, Xerox PARC, SLAC, HP Labs and NASA Ames eclipsed, but the brains wandered down the road to new institutions, taking with them expired patents and deep knowledge. In the Silicon Valley, the talent is preserved.

While in Rochester the scientists who invented digital photography were lost to humanity when Eastman Kodak faltered and then faded, their colleagues at HP Labs just modified their commute from Palo Alto to Cupertino and are still working on the iPhone camera and imaging system. Few know that Siri was born at SRI and is now evolving at Nuance in the skilled hands of PARC alumni. Maybe, Google could start a self-driving car project due to the available engineers who built navigation systems for submarines.

In fact, the less than thousand scientists who have worked at PARC in its first 20 years, have created the largest pot of wealth in Silicon Valley, as documented by Henry Chesbrough, the executive director of the Center for Open Innovation at the Haas School of Business at UC Berkeley. A beautiful example of ecological brain recycling! The Silicon Valley is a biotope that promotes talent.

A group of leading color science researchers congregated in the Silicon Valley to openly ponder about the future of color science

Update: a related article just appeared in the HBR: Renaissance Florence Was a Better Model for Innovation than Silicon Valley Is [paywall]

Tuesday, January 19, 2016

The Fourth Industrial Revolution

In the last couple of decades we have been ambulating in a buzzword fog, with terms that started from the erudite ubiquitous computing to the folksy data mining. Then the buzzwords became increasingly silly with cloud, virtualization, big data, data lakes, social networks, mobility, internet of things, and the like.

Something is going on, but in this buzzword fog it can be difficult to discern what is really happening. Starting tomorrow in Davos and Klosters, the World Economic Forum plans to shine some light on this cacophony and elucidate the current technological events.

First and foremost, a new term to replace the buzzword fog: The Fourth Industrial Revolution. The First Industrial Revolution used water and steam power to mechanize production. The Second used electric power to create mass production. The Third used electronics and information technology to automate production. Now a Fourth Industrial Revolution is building on the Third, the digital revolution that has been occurring since the middle of the last century. It is characterized by a fusion of technologies that is blurring the lines between the physical, digital, and biological spheres.

Th Industrial Revolutions

The inexorable shift from simple digitization (the Third Industrial Revolution) to innovation based on combinations of technologies (the Fourth Industrial Revolution) is forcing companies to reexamine the way they do business. The bottom line, however, is the same: everybody needs to understand their changing environment, challenge the assumptions of their operating teams, and relentlessly and continuously innovate.

The important here is not to be a Luddite and to keep learning. Like our great-grandparents had to learn manufacturing car engines instead of buggy whips, we have to learn aggregating information, services, and tools to produce more efficient tools for increasing the efficiency of society. As always, not everything is rosy—for example, at the upcoming Super Bowl event here in the Valley, the FBI is fearing a biohazard delivered by drone swarms flying over the Santa Clara stadium—but we have to stay focused and determined.

We must develop a comprehensive and globally shared view of how technology is affecting our lives and reshaping our economic, social, cultural, and human environments. The World Economic Forum annual meeting starting tomorrow will shine some light and let us see what we should learn and where we should go next.

Friday, January 15, 2016

The nights are still blue

A few years ago (20 September 2012 to be exact), we had written about our blue nights. We still did not get used to them, but we let the rhododendrons outside the picture window grow all the way to the roof.

Yesterday, I was catching up on my reading and I came across the interesting article LED light pollution: Can we save energy and save the night? by Mark Crawford in the January 2016 issue of SPIE Professional. When we buy light bulbs for the home, we look at the spectral distribution of the light they emit and buy models that have natural spectra. Crawford reports that LEDs designed for street lighting are optimized differently and have typically a correlated color temperature of 6500ºK and are dominated by a narrow, short-wavelength emission band together with a broader long-wavelength emission band.

This results in excessive light pollution, as illustrated in the figure below. This image of Milan was acquired after the transition to LED technology in the downtown area. The illumination levels appear to be similar or even brighter in the city than the suburbs, and the amount of blue light is now much higher, which suggests a greater impact on the ability to see the stars, human health and the environment. Since the European Space Agency’s NightPod device was installed on the ISS in 2012, astronauts have been taking systematic night images. It incorporates a motorized tripod that compensates for the station’s speed and the motion of the Earth below. Before that motion could blur images even though astronauts compensated with high-speed films and manual tracking. This NASA/ESA image was taken by Samantha Cristoforetti.

City center of Milano. NASA/ESA image was taken by Samantha Cristoforetti

Do we really need so much street light at night? When I was a kid, all establishments had to close before midnight and half an hour later the street lighting would go off, until six o'clock in the morning. It does not have to be that drastic, but I do not think our lives would be any different if the street lights would be dimmed to well under 0.25 lux after midnight. We no longer walk, and all vehicles have headlights by law. For the few pedestrian a faint light is more than enough because we adapt to the light level: the pupils dilate and eventually we switch to scotopic vision. When I was a kid and walked home after the street lights were off, I could see extremely well at full Moon, which is just 0.25 lux. When the Moon is not full, you just walk slower and enjoy the Milky Way.

I am not a Luddite. In Palo Alto, the heart of the Silicon Valley, in the Barron Park neighborhood everybody turns off the porch lights at night, and these people are the cream of the digerati. The neighborhood's park is called Bol Park, after Stanford physicist and research associate Cornelis Bol, the inventor of the high-intensity mercury vapor lamp.

Vincent van Gogh: Nuit étoilée (Saint-Rémy-de-Provence), 1889

Tuesday, December 15, 2015

OLED TVs are here

OLED displays for handheld devices have been around for a while, but large TVs have been rare due to manufacturing difficulties. LG group has been one of the few manufacturers. At the end of October, Panasonic has introduced a 65 inch 4K ultrahigh definition TV in Europe. It combines LG's OLED panel with Panasonic's image processing technology.

Despite a high price of 10,000€, the TVs are selling well. If in the past the rule was that your display can never be too bright, now the rule is that your display's gamut is never too big.

Link to article

Wednesday, October 28, 2015

Data Science Camp

Last Saturday (October 24, 2015) was SF Bay ACM's annual Data Science Camp Silicon Valley. The venue was the Town Hall in PayPal's intergalactic headquarters on 2161 North 1st Street in San Jose, nestled between eBay and Apple's future Car Division, just one block from Comet, Foveon, Lumileds, Peaxy, and Toyota's research lab.

From the voting on the sessions, it appears that the event attracted data scientists from all over the US, with a large number of participants people who had taken the Coursera classes on big data, machine learning, and data science and were now wondering how to progress from Coursera to a job (82 votes). As Sara Kalisin from Intel noted, when companies try out analytics, they do not really know what to do with the result and end up not staffing the project because the benefit is less than the employee's salary. In addition to the session on "Coursera to job," Sara also led a session with title "How to showcase data
science impact" (15 votes).

At the beginning of the day, Joseph Rickert and Robert Horton of Microsoft gave the tutorial "Introduction to R for Machine Learning." R, which goes back to Bell Labs in 1976, has become the most widely used data analysis software. It is undergoing an astonishing growth and today has about 36,820 functions. The CRAN repository has a solution for almost all data analysis problems.

Originally, R was a statistical package with statistical tools acting on observations stored in arrays. There were different packages, like Weka, for mining large data sets stored in files with machine learning to classify patterns and make predictions. However, today R has all the machine learning functionality on top of the original statistical tools. This has been possible because today a serious workstation has at least 64 GB of RAM, which allows to store big data in arrays.

When the data is too large to fit in memory, it can be partitioned into blocks which can be processed sequentially or in parallel. However, this capability is not available with the free version of R and requires the expensive commercial enterprise version. Robert Horton announced that SQL Server 2016 will support the server-side execution of R functions. This means that the data no longer will have to be moved across the network for analysis.

After the sponsored lunch, John Park from HP led a double session with title "Malware Classification +
ML + Crowd Sourcing" (46+44 votes). The amount of malware injected every day is mind-boggling. He uses an algorithm called Nilsimsa Hash on the binary files and uses natural language processing and classifiers trained through crowd-sourcing to find the malware.

Another very popular session with the title "I have data. Now What? IOT wearable space in life sciences. How to analyze the data for individual users. How do we build that" (117 votes) was led by Lee Courtney from Qurasense. Representatives from large international financial institutions and the health care industry participated in this interactive session. The only company permanently storing all data and continuously mining all of it was a manufacturer of set-top boxes for the cable industry.

For everybody else, storing the data was just too dangerous because of the flood of malware, while Hadoop has no security. This requires ingesting the data, mining it, then deleting it. Because the HDFS data ingest is very slow and each file must be stored in three copies, as little data as possible is preserved. At the end of the session, Lee Courtney summarized the top three unsolved problems for big data as

  1. no security
  2. no security
  3. poor ingest performance

As a note, there are file systems with excellent security and supporting HDFS. They have a POSIX interface, so it is not necessary to move the data at all.

Moving the data is a big problem that will not go away, as illustrated by this figure by DataCore. Until about eight years ago, processors kept getting faster. However, when the wall of physical transistor shrinking was hit, the clock rates actually became slower to control the heat generation and allow for more cores on each die. With the increasing number of cores—for example, a typical workstation processor has eight cores each with two hyperthreads, for a total of 16 cores—less IO bandwidth is available to each CPU: the number of pins on the chip is still the same, and there is a growing IO gap.

The IO gap is rapidly increasing

Some people believe that cloud computing and virtual machines solve the problem. However, for storage this is an illusion. Indeed, according to the Hennessy/Patterson rules of thumb, for general computing the utilization rate is about 0.20–0.25 [HOHY14]. For storage servers, the utilization rate is 0.6–0.8 [AFG+10], therefore statistical multiplexing is less useful because with the OS and hypervisor overheads a processor is maxed out. The IO gap comes on top of this!

[AFG+10] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, et al. A view of cloud computing. Communications of the ACM, 53(4):50–58, 2010.

[HOHY14] Md. Iqbal Hossain (Older) and Md. Iqbal Hossain (Younger). Dynamic scaling of a web-based application in a cloud architecture. Master’s thesis, School of Information and Communication Technology, KTH Royal Institute of Technology, Stockholm, 2014.

Thursday, October 22, 2015

Photonics for data access and delivery

When we use the cloud for heavy duty computations, we quickly find out that although storage is inexpensive, we never know how fast we can access our data. If we want guaranteed performance in terms of IOPS, the price quickly goes up. This has to do with the distance the data has to travel. In a warehouse-scale datacenter, we have two cabling lengths: up to a couple meters for vertical networking in a rack and up to a couple hundred meters for horizontal networking between racks. There are some game-changing technology developments on the horizon regarding the horizontal networking.

For the vertical cabling, copper can be used, but for the horizontal cabling fiber optics has to be used due to dispersion, as shown in the figure below.

dispersion in cabling

Since the processor is electronic, at each end of the optical fiber cable we need an expensive transducer. Chip companies have invested substantial resources trying to grow lasers on CMOS chips to feed an optical cable, but the physical manufacturing process was very messy and never worked well.

The breakthrough came a dozen years ago with the development of nanotechnology to fabricate devices based on the optical bandgap. This allowed to take a PowerPC processor, make it flat, then use nanotechnology to grow an optical resonance ring on top of it. Now the laser source can be external to the chip and make a round in the ring while being modulated from the PowerPC. This was a seminal breakthrough.

At the end of July, Jürg Leuthold, professor of photonics and communications at ETH Zurich, and his colleagues had another seminal breakthrough. Hitherto, miniaturization of the modulator has been limited by the wavelength of the laser. In order to beat that limit and to make the device even smaller, the light is first turned into so-called surface plasmon polaritons (SPP). Plasmon polaritons are a combination of electromagnetic fields and electrons that propagate along a surface of a metal strip. At the end of the strip, they are converted back to light once again. The advantage of this detour is that plasmon polaritons can be confined in a much smaller space than the light they originated from. The signal is created by modulating the plasmon-polaritons in an interferometer.

By applying a voltage, the refractive index and hence the velocity of the plasmons in one arm of the interferometer can be varied, which in turn changes their amplitude of oscillation at the exit. After that, the plasmons are re-converted into light, which is fed into a fiber optic cable for further transmission.

This is the cheapest modulator ever built. It is very simple, consisting of a gold layer on glass that is only 150 nm thick and an organic material whose refractive index changes when an electric voltage is applied and that thus modulates the plasmons inside the interferometer. As such a modulator is much smaller than conventional devices and it consumes very little energy—only a few thousandth of Watts at a data transmission rate of 70 Gigabits per second. This corresponds to merely a hundredth of the consumption of commercial models.

Source: Nature Photonics 9, 525–528 (2015), doi:10.1038/nphoton.2015.127.

New technologies take a few years to evolve from universities to industrial research labs and then to industry. One of the accelerators of this process is CERN, in particular, the Large Hadron Collider (LHC) project. The detectors produce about 25 PB of data each year, which travel through more than 100,000 optical links to obtain the required bandwidth. The distance from the detectors to the control rooms is about 150 m and the connectors have to be able to withstand the 3.8 Tesla magnetic field in the detector and enormous levels of radiation.

For this application, scientists at CERN have developed so-called “versatile-link” optical components with a minimal energy consumption. By 2018 Marcel Zeiler and his colleagues will have new modulators—of course, also based on interferometry—that can withstand the harsh environment in the LCH.

Source: SPIE Professional October 2015.

Although in a datacenter the radiation is negligible and the magnetic fields are very far from Teslas, the experience is that CERN technology transitions to industry very fast, so we should not be surprised to see new generation versatile optical links in a year or two at most. The capability of economically moving data for hundreds of meters on 100 Gigabit Ethernet (100GbE) links renders old architectures like Hadoop moot because there is no reason for moving the data to HDFS for MapReduce.

Wednesday, September 30, 2015

Retina's transcripotme

From Santiago Ramón y Cajal 's time we have known that there are five types of neuronal cells in the retina: rods & cones, horizontal cells, bipolar cells, amacrine cells, and retinal ganglion cells.

there are five types of neuronal cells in the retina

With Sharpe et al. (L. T. Sharpe, A. Stockman, H. Jgle, and J. Nathans. opsin genes, cone photopigments and color vision. Color vision: From genes to perception, pages 3–51, 1999) we learned that the spectral sensitivity of the pigments in the cones is controlled by the not-so-robust order of the visual pigment genes in the sex chromosome and color vision deficiency had to do with the L peak moving towards the M peak or vice versa.

The mechanisms behind color vision deficiencies

However, the genome only allowed to predict a predisposition for color vision deficiency, not a prediction of the spectral color performance. The reason is that not the genes determine the spectral peaks but their expression by the transcriptome, i.e., the messenger RNA (mRNA), which of course cannot be studied in vivo.

In a recent paper (E. Z. Macosko, A. Basu, R. Satija, J. Nemesh, K. Shekhar, M. Goldman, I. Tirosh, A. R. Bialas, N. Kamitaki, E. M. Martersteck, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214, 2015), Macosko et al. describe the application of a new technique called Drop-seq, which has allowed them to analyze the gene activity of 44,808 cells from 14-day-old mice retinae. "Gene activity" here means that they analyzed the transcriptomes of these 44,808 retinal cells and identified 39 transcriptionally distinct cell populations, each corresponding to one of a group of closely related cell types.

Drop-seq generates a library of STAMPs (single-cell transcriptomes attached to micro-particles). They used Seurat, a recently developed R package for single-cell analysis, to study this STAMP library. In a first step, they performed a principal component analysts on the largest libraries, then they reduced the 32 statistically significant principal components to two dimensions using t-distributed stochastic neighbor embedding (tSNE).

Subsequently they projected the remaining cells in the data into the tSNE analysis. Then they combined a density clustering approach with post hoc differential expression analysis to divide the 44,808 cells among 39 transcriptionally distinct clusters, obtaining this illustration:

two-dimensional representation (tSNE) of global gene expression relationships among 44,808 cells

Finally, they organized the 39 cell populations into larger categories (classes) by building a dendrogram of similarity relationships among the 39 cell populations. For now, the result is that they can say a lot about the amacrine cells that was not known before. However, it will take more research to formulate an interpretation for the visual system.

Wednesday, August 12, 2015

9,098,487: categorization based on word distance

One of the correlates for the social appreciation or value of scientists is the Gini coefficient. Indeed, poor people cannot afford technologies that make their lives more comfortable (and would not be able to amortize the investment anyway because their labor has little value). Rich people also cannot necessarily amortize the investment in new technology, because they can just hire the poor to do the work for them for a low pay. What drives technology is a low Gini coefficient, because it is a broad middle class that can and does invest in new technologies that makes their lives more efficient.

Worldwide, over the past two centuries the Gini factor has been on the rise, yet there has been incredible technological progress. This means we have to look at a smaller geographical scale. Indeed, for the late 2000s, the United States had the 4th highest measure of income inequality out of the 34 OECD countries measured, after taxes and transfers had been taken into account. As it happens, except for the medical sciences, the American science establishment is only a pale shadow of what it was half a century ago.

In this context, we were naive when three years ago we embarked to solve an efficiency problem for customers processing medical bills, invoices and receipts. These artifacts are still mostly on paper and companies spend a huge amount of time having minimum wage people ingesting it for processing in their accounting systems. In the datasets we received from the customers, the medical bills were usually in good physical shape, but the spelling checkers in the OCR engines had a hard time cracking the cryptic jargon. Receipts were in worst physical shape, printed on poor printers and crumpled, and often imaged by taking a picture with a smart phone.

Surprisingly, invoices were also difficult to process. They often had beverage stains and sometimes they had been faxed several times using machines that looked like they had mustard on the platen and mayonnaise under the lid. But the worst was that in the datasets we received, the form design of the invoices was very inconsistent: the fields and their labels were all over the place, many gratuitous and inconsistent abbreviations were used, etc.

Considerable effort had been spent in the 80s to solve this problem by people like Dan Bloomberg with his mathematical morphology methods to clean up the scanned images (Meg Withgott coined the phrase "document dry cleaning") and Rob Tow with David Hecht and others invented the glyph technology to mark forms so that the location and semantics of each field could be looked up. Maybe due to the increasing Gini coefficient they were not commercially successful. However because this time we had actual customers, we decided to give it another go.

Note that OCR packages already have pretty good built-in spelling checkers, so we are dealing with hard cases. The standard approach used in semantic analysis is based on predetermining all situations for a lexeme and store them into NoSQL databases. In our applications this turned out to be too slow: we needed a response time under 16 µs.

Looking at our dataset, we had a number of different issue types:

  • synonym: a word or phrase that means exactly or nearly the same as another word or phrase in the same language
  • abbreviation: a shortened form of a word or phrase
  • figure of speech: a word or phrase used in a nonliteral sense to add rhetorical force to written passage
  • misspelling: a word or phrase spelled incorrectly
  • mistyping: like misspelling, but depends on the distance between the keys on a keyboard or the stroke classes in the OCR engine
  • metonym: a word, name, or expression used as a substitute for something else with which it is closely associated
  • synecdoche: a figure of speech in which a part is made to represent the whole or vice versa
  • metalepsis: a figure of speech in which a word or phrase from figurative speech is used in a new context
  • kenning: a compound expression in Old English and Old Norse poetry with metaphorical meaning
  • acronym: an abbreviation formed from the initial letters of other words and pronounced as a word

In addition to computational speed, we wanted to be able to address each of these constructs explicitely. In the following we call a lexeme "category" because the used process is a categorization more than a linguistic analysis.

We solved the problem by introducing a metric, so that we could deal with everything as distances and intervals. For the metric we chose the edit distance, also known as Levenshtein distance: the the number of edits used to change a first word in the category into a second word in the category, such as using additions, deletions, substitutions, and transpositions. This metric can be computed very fast. We also tried the Damerau-Levenshtein distance, but in this application it did not make a difference.

With a metric, everything was simple. We took the lexemes in each category and computed the center of gravity to determine the prototype for each category and the diameter of the category. The category then received a label that is the word or phrase used in the document typing application.

interval of a lexeme

With misspellings the diameters could be small, but with with synonyms and figures of speech the intervals could be quite large and can overlap. The intersections cloud easily be computed. Because in all three out datasets the dictionaries were small, we easily resolved the overlaps visually by constructing a graph in which each node had the category diameter as the value and the edges between two nodes had their Levenshtein distance as a weight. Then we plotted the graph with Gephi and split up the overlapping categories into smaller one with the same label.

With this, document typing became very fast: for each word or phrase we looked into which category it fell, and in there we looked for a match. When there was one, we replaced it with the lexeme's label, when not, we added it to the category and logged it for manual verification.

Patent 9,098,487 was filed November 29, 2012 and issued August 4, 2015.

Friday, July 17, 2015

What can be addressed vs what can be seen

Remember the early days of color in consumer PCs? They started out with 256 colors or the slightly less "web safe" colors and progressed to "thousands of colors" (or "high color"). Just a short time later, 24-bit display controllers became sufficiently cheap to make full (or "true") color viable for consumer PCs. The marketing slogan at the time was pushing the display capability as "over 16 million colors."

We color scientists cringed at this claim and tried to explain to the marketeers that the human visual system can only discriminate 4 to 7 million colors, consequently the marketing collaterals should use the adjective "addressable": the product can address 16,777,216 colors.

A similar issue is surfacing in the storage technology sector. Consumers are told they can store terabytes (TB) of images or video (for free or almost) and businesses are offered petabyte (PB) size storage, while storage vendors claim their systems can scale to exabyte (EB) scale. In reality this is not what can be used but what can be addressed.

For example, my home NAS is a RAID of two 4 TB drives. The parts cost about $500; could I have saved this money and assembly time by using a free cloud service? The answer is a simple no, because on my VDSL connection it would take me 2 years to upload 4 TB of data.

In business applications the situation is similar. When you have PBs of data, you can no longer do backups because they take too long. You have to replicate the data and even with the latest technologies copying a TB takes 30 minutes, which is a long time if you have EB scale data. Even reformatting a drive with too many IO failures takes too long in a PB scale datacenter and self-encrypting drives are the only viable solution.

How can we find a measure for the practically usable storage size as opposed to the addressable storage? The previous paragraph suggests that a possible metric is the time required to maintain the high availability (HA) of your data. You may have enough bandwidth to ingest a large volume of data, but hopefully you are also reading and processing it. To this you have to add the bandwidth the storage system needs to maintain at least three replicates of the data.

This sounds easy, but it is not realistic. In modern storage the key measure is the scalability. A recent CACM paper is an excellent introduction on how to study your system using Gunther's Universal Scalability Law (USL). This paper analyzes the speedup of TeraSort on AWS EC2. The figure shows the modeled speedup with parameters σ = −0.0288, κ = 0.000447 for c1.xlarge instances optimized for compute with 8 virtual Xeon cores, 7 GiB memory, 4 × 420 GB instance storage and high network performance.

When considering the practical scalability of systems, we are interested only in the linear portion of the curve. Gunther's USL teaches us that for the out-of-the-box version of the Hadoop TeraSort MapReduce workload you can only scale linearly to 48 nodes, then you are out of steam (in their experiments the authors limited the TeraSort dataset to 100 GB).

TeraSort scalability on EC2

Note that I explicitly wrote "Hadoop TeraSort MapReduce workload." This is because there is no single magic number. When you plan the configuration of your system, you have to carefully determine your workloads and measure their performance to estimate the parameters for the USL model.

The volume of your data is probably given by your application. The USL model allows you to optimize the system configuration so it has the required elasticity. The CACM paper shows how you optimize your application. By optimizing parameters like timeouts and cache sizes the authors were able to increase the speedup by 30% and extend the linear portion from 48 nodes to 95 nodes. This is a substantial improvement.

It is a good exercise to put the USL model in a graphing application (e.g., the free Grapher in the Utilities folder of MacOS) and animating the parameters.

The linear parameter σ (i.e., the coefficient of the linear term) is a correlate of contention. This is the parameter optimized in scale-up systems or vertical systems. An example of a "knob" you can turn in the system to control contention is the bandwidth. This can be the LAN speed, the number of local drives, etc. What you learn by animating the linear parameter is that by optimizing a scale-up system you can increase the speed-up. However, the linear portion does not change: the range of nodes in which the system scales is constant.

Recently the cost of enterprise class 2 TB SSDs has dropped below $1000, which due to the greater longevity brings the total cost of ownership down to that of rotating disks. Consequently many have tried to replace their HDDs with SSDs, and the Internet is full of academic papers, newsletters and blog posts from people lamenting that their storage systems have not become faster. First, they were just changing one component of contention, so scalability is not improved. Second, in a system you cannot just change out one component and hope for better performance, especially if it was not a bottleneck: you have to optimize the whole system using a USL model.

The quadratic parameter κ is a correlate of coherency. This accounts for the cost of message passing, for example for cache consistency and locking. This is the parameter optimized in scale-out systems or horizontal systems. What you learn by animating the quadratic parameter is that by optimizing a scale-out system you can increase the speed-up and also the linear portion increases: the range of nodes in which the system scales is variable.

Optimizing intra-node communication is a scheduling problem and is difficult. However, the payoffs can be huge. For example, there have been reports in the literature that when instead of managing the TRIM operation at the SSD drive level it is managed at the system level, very substantial speed-ups and scalability extensions are achieved.

Software defined storage (SDS) systems are typically scale-out systems. Looking at the recent business results of storage companies, it is clear that scale-up system sales are declining at the expense of scale-out system sales. Large vendors are investing large sums in the development of SDS technologies.

Using Gunther's USL model should not only encompass the storage system. The big payoffs come when the algorithms are part of the model. For example, Hadoop is from a time of slow networks and today Spark is the better framework for distributed computing.

If you are into big data, now is the time to become intimately familiar with Gunther's Universal Scalability Law.