Thursday, January 19, 2017

Unable to complete backup. An error occurred while creating the backup folder

For the past four years, I have been backing up my laptop on a G-Technology Firewire disk connected to the hub in my display. So far it worked without a hitch, but a few days ago I started to get the error message

Time Machine couldn’t complete the backup to “hikae”.

Unable to complete backup. An error occurred while creating the backup folder.

The message appeared without a time pattern, so it was not clear what it could be. The drive could not be unmounted and had to be force-ejected and power-cycled and then worked again until the next irregular event, maybe one backup out of ten.

When I ran Disk Utility to see if something was wrong with the drive, it told me the boot block was corrupted. After fixing it, the Time Machine problem did not go away, so I must have corrupted the boot block with the force-eject. Time to find out what is going on.

The next time it happened, I tried to eject the drive from Disk Utility, which gave me the message

Disk cannot be unmounted because it is in use.

Who on Earth would be using it? Did Time Machine hang? Unix to the rescue, let us get the list of open files

sudo lsof /Volumes/hikae

The user is root and the commands are mds and mds_store on index files. They are indexing the drive for Spotlight. Why on Earth would an operating system index a backup drive by default? Let us get rid of that.

sudo mdutil -i off /Volumes/hikae

However, in this state, the command returns "Error: unable to perform operation. (-400) Error: unknown indexing state." This might mean Spotlight has crashed or is otherwise hanging.

Force Eject and power cycle the drive. This time mdutil works:


2017-01-18 17:10:00.657 mdutil[25737:7707511] mdutil disabling Spotlight: /Volumes/hikae -> kMDConfigSearchLevelFSSearchOnly\\Indexing and searching disabled.

For the past two days, I have no longer experienced the problem.

If you are the product manager, why is Spotlight indexing backup drives by default?

If you prefer using a GUI, drag and drop your backup drive icon into the privacy pane of the Spotlight preference window (I did not try this):

Tell Spotlight not to index your backup drive

Wednesday, January 11, 2017

Designing and assessing near-eye displays to increase user inclusivity

Today Emily Cooper, Psychological and Brain Sciences Department at Dartmouth College, gave a talk on designing and assessing near-eye displays to increase user inclusivity. A near-eye display is a wearable display, for example, an augmented reality (AR) or a virtual reality (VR) display.

With most near-eye displays it is not possible or recommended to wear glasses. Some displays, like the HTV Vive, have available lenses to correct the accommodation. We do want to integrate flexible correction into near-eye displays. This can be achieved with a liquid polymer lens with a membrane that can be tuned.

In her lab, for the refraction self-test, the presenter uses an EyeNetra auto-refractometer, which is controlled with a smartphone.

The near-eye display correction is as good as with contact lenses, both in sharpness and in fusion correction. Therefore, it is not necessary to make users wear their correction glasses.

There are two factors determining the image quality of a near-eye display: accommodation and vergence. The problems with incorrect vergence are that users get tired after 20 minutes and the reaction time is slower when the vergence is incorrect.

The solution is to use tunable optics to match the user's visual shortcomings.

A different problem is presbyopia, which is a range reduction. For people older than 45 years, an uncorrected stereo display provides better image quality than correcting the accommodation. However, tunable optics provide better vergence for older people.

A harder problem are people with low vision, regardless of their age. In her lab, Emily Cooper investigated whether consumer-grade augmented reality displays are good enough to help users with low vision.

She used the HoloLens, in which the depth camera in the NIR domain is the key feature to address this problem. Her proposal is to overlay the depth information as a luminance map over the image so that near objects are light and far objects are dark. This allows the users to get by with their residual vision.

Instead of a luminance overlay, a color overlay also works. In this approach, the hue is changed on a segment from warm to cold colors in dependence of their distance. She also tried to encode depth with flicker but is does not work well.

With the HoloLens, it is possible to integrate OCR in the near-eye display and then read all text in the field of view using the 4 speakers in the HoloLens, making the sound come from the location where the text is written.

Saturday, December 31, 2016

Business backs the basics

The last third of the year has been very busy and I did not have a chance to stay current with my reading. Consequently I do not have anything to write.

Editors write editorials, which are rarely read. Indeed, editorials are useful mostly for editors because it forces them to structure their journal or conference. Unfortunately, they are usually written under time pressure and are not always well-rounded. Still, better than my writer's block, here is an editorial written by Subra Suresh and Robert A. Bradway for their CEOs and Leaders for Science retreat at Sunnylands in Rancho Mirage, as it was published in Science 14 Oct 2016: Vol. 354, Issue 6309, pp. 151 DOI: 10.1126/science.aal1580.

Earlier this year, a number of leaders from major U.S. corporations gathered at Sunnylands in California to discuss the critical importance of basic scientific research. For decades, the private sector has withdrawn from some areas of basic research, as accelerating market pressures, the speed of innovation, and the need to protect intellectual property in a global marketplace made a Bell Labs–style, in-house model of discovery and development hard to sustain. However, the leaders who gathered for the “CEOs and Leaders for Science” retreat (which we convened) agreed that basic research will make or break corporations in the long term. Why?

Long-term basic research, substantially funded by the U.S. government, underlies some of industry's most profitable innovations. Global positioning system technology, now a staple in every mobile phone, emerged from Cold War Defense Department research and decades of National Science Foundation explorations. As well, long-term public–private partnerships in basic research have driven U.S. leadership, from information technology to drug development and medical advancement. For example, the Human Genome Project combined $14.5 billion in federal investment with a private-sector initiative, generating nearly $1 trillion in jobs, personal wealth for entrepreneurs, and taxes by 2013. Such endeavors created a science ecosystem that in turn generated the talent pipeline upon which it depended.

Although for-profit corporations still invest in proprietary product development and expensive clinical trials, industry finds itself unable to invest in basic research the way it once did. The need for increased corporate secrecy, market force–driven short-term decision-making, and narrowing windows to monetize new technologies have whittled away industry's willingness and ability to conduct basic research. This change threatens U.S. preeminence in research. For instance, the nation may lose its ability to attract and retain the finest talent from around the world. A good fraction of the students who earn advanced degrees in science and technology in the United States come from abroad because of the nation's scientific excellence. For decades, American companies could attract and retain the finest talent from around the world. But if the U.S. loses its edge in research, it may also lose this vital resource of expertise and innovation.

Consequently, business leaders assembled at Sunnylands resolved to use their individual and collective credibility, and their stature as heads of enterprises that fuel the economy, to advocate for greater government support for basic scientific research to revitalize the science ecosystem. However, they will need to lift sagging public opinion because many Americans now see basic research as a luxury rather than a necessity. A 2015 Pew poll found that Americans who view publically funded basic research as “not worth it” rose from 18 to 24% between 2009 and 2014. At the same time, those who believe private investment is enough to ensure scientific progress also increased from 29 to 34%.

With that in mind, the CEOs will partner with academic leaders to educate the public about the importance of basic research. Together, they will advocate for this in meetings with federal officials, through various media channels, and by asking presidents in the Association of American Universities to identify corporate leaders in their respective communities to join the effort. The hope is that this concerted action positions basic research atop the next U.S. president's agenda.

History has shown that investments in basic research are the primary engine by which humanity has advanced, and major economic gains—often unanticipated when the research was initially funded—have been realized. In the United States, that will require a long-term commitment from the government, complementing the ongoing investment of risk capital and key industry sectors.

America's leadership role in scientific innovation is an inherited responsibility and an economic imperative. It must not be neglected.

Credit: Emily Gadek. This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.

Friday, November 11, 2016

App adds makeup to faces on video conferences

In a potential boost for the government’s drive to get more people telecommuting, cosmetics company Shiseido Co. has developed an app that makes users look as if they are wearing makeup. It amounts to an instant makeover for the unfortunate worker called to appear on screen from home at an awkward hour.

Read the article in the Japan Times.

Yoko's lips

Thursday, October 13, 2016

Facebook Surround

Yesterday afternoon, Brian Cabral, Director of Engineering at Facebook, gave a talk at the Stanford Center for Image Systems Engineering (SCIEN) with the title "The Soul of a New Camera: The design of Facebook's Surround Open Source 3D-360 video camera." Here is his abstract:

Around a year ago we set out to create an open-source reference design for a 3D-360 camera. In nine months, we had designed and built the camera and published the specs and code. Our team leveraged a series of maturing technologies in this effort. Advances and availability in sensor technology, 20+ of computer vision algorithm development, 3D printing, rapid design photo-typing and computation photography allowed our team to move extremely fast. We will delve into the roles each of these technologies played in the designing of the camera, giving an overview of the system components and discussing the tradeoffs made during the design process. The engineering complexities and technical elements of 360 stereoscopic video capture will be discussed as well. We will end with some demos of the system and its output.

The design goals for the Surround were the following:

  • High-quality 3D-360 video
  • Reliable and durable
  • Fully spherical
  • Open and accessible
  • End-to-end system

These goals cannot be achieved by strapping together GoPro cameras because they get too hot and it is very difficult to make them work reliably. Monoscopic is old and no longer interesting. The challenge for VR is to do it stereoscopically: we are interested in a stereoscopic 3D-360 capture.

They are using 14 Point Grey cameras with wide angle lenses around the equator and a camera with a fisheye on the north pole. For the south pole they are using two fisheyes to get rid of the pole holding the Surround.

A rolling shutter is much worse in 3D than in 2D, so it is necessary to use a global shutter, at the expense of SNR. Brian Cabral discussed the various trade-offs between number and size of cameras, spatial resolution, wide angle vs. fisheye lenses and physical size.

Today, we have a lot of progress in rapid prototype designs. We can just try out things in the lab. For this application, the hardware is easy, but stitching together the images is difficult. The solution is to use optical flow and to simulate slit cameras.

No attempt is made to compress the data. The images are copied completely raw to a RAID of SSD drives. The rendering then takes 30 seconds per frame.

The Surround has been used for a multi-million dollar shot at grand Central Station. The camera is being open sourced because so far it is only 1% of the solution and making it open will encourage many people to contribute to the remaining 99%.

At the end of the presentation, two VR displays were available to experience the result. I did not quite dare to strap in front of my eyes a recalled smartphone that can explode anytime, so I passed on the demo. However, the brave people commented, that you can rotate your head but not move sidewise because the image falls apart. It was also commented, that the frame rate should be at least 90 Hz. Finally, people reported vergence problems and slight nausea.

Facebook Surround kit

Dataset metadata for search engine optimization

Last week I wrote a post on metadata. Google is experimenting with a new metadata schema it calls Science Datasets that will allow it to better make public datasets discoverable.

The mechanism is under development and they are currently soliciting interested parties with the following kinds of public data:

  • A table or a CSV file with some data
  • A file in a proprietary format that contains data
  • A collection of files that together constitute some meaningful dataset
  • A structured object with data in some other format that you might want to load into a special tool for processing
  • Images capturing the data
  • Anything that looks like a dataset to you

In your metadata schema you can use any of the dataset properties, but it should contain at least the following basic properties: name, description, url, sameAs, version, keywords, variableMeasured, and If your dataset is part of a corpus, you can reference it in the includedInDataCatalog property.

There are also properties for download information, temporal coverage, spatial coverage, citations and publications, and provenance and license information.

This is a worthwhile effort to make your research and public datasets more useful to the community.

Creative Commons LicenseGoogle

Thursday, October 6, 2016

Progress in wearable displays

Yesterday afternoon, Bernard Kress, Partner Optical Architect at Microsoft Corp, in the HoloLens project, gave a talk at the Stanford Center for Image Systems Engineering (SCIEN) with the title "Human-centric optical design: a key for next generation AR and VR optics." Here is the abstract:

The ultimate wearable display is an information device that people can use all day. It should be as forgettable as a pair of glasses or a watch, but more useful than a smartphone. It should be small, light, low-power, high-resolution and have a large field of view (FOV). Oh, and one more thing, it should be able to switch from VR to AR.

These requirements pose challenges for hardware and, most importantly, optical design. In this talk, I will review existing AR and VR optical architectures and explain why it is difficult to create a small, light and high-resolution display that has a wide FOV. Because comfort is king, new optical designs for the next-generation AR and VR system should be guided by an understanding of the capabilities and limitations of the human visual system.

There are three kinds of wearable displays:

  • Smart eyewear: extension of eyewear. Example: Google Glass
  • Augmented reality (AR) and mixed reality (MR): extension of the computer. An MR display has a built-in 3d scanner to create a 3d model of the world
  • Virtual reality (VR): extension of the gaming console

Bernard surveyed all avenues in wearable displays from their inception to the projections in the future. The speed of the presentation and the amount of material made it impossible to follow the talk unless you are an expert in the field. After the presentation, Bernard told me the size of his PowerPoint file is about 250 MB!

My takeaway was that the biggest issue in wearable displays is cost. So far, the optics engineers designed with cameras in mind and over-designed. The current breakthrough is that now the optics engineers start understanding the HVS, so they can design systems that are just as good as our MTF. Bernard claims that so far the industry has been mostly about hype but in 2017, products will take off and the new challenge is "show me the money."

By Microsoft Sweden [CC BY 2.0 (], via Wikimedia Commons

Tuesday, October 4, 2016


As Carlsson notes, big data in not about "big" but about complexity in format and structure. We can approach the format complexity through metadata, which allows us to navigate through the data sets and to determine what they are about.

Two important requirements on experiments are replicability and reproducibility. Replicability refers to the ability to rerun the exact data experiment to produce exactly the same result; it is an aspect of governance and it is good practice to always have somebody else to check the data and its analysis before it is published. Reproducibility refers to the ability to use different data, techniques, and equipment to confirm the same result as previously obtained. We can be confident in a result only after it has been reproduced independently. These two requirements guide us to what kind of metadata we need.

There are three classes of metadata: context, syntax, and semantic.

Context of data refers to how, when, and where it was collected. The context is usually written in a lab book. If we need to replicate an analysis at a later time, the lab book might be unretrievable, therefore the context of data has to be stored with the data. This can also be a big money and time saver because some ancillary data we need for an analysis might already be available from a previous experiment; we need to be able to find it.

The syntax of data refers to the format. Analysts spend a large amount of their time wrangling data. When the format of each time series is clearly described, this tedious work can be greatly simplified. During replication and reproduction, it can also help diagnose such frequent errors like the confusion between metric and imperial units of measure. Ideally, with the data, we should also store the APIs to the data because they are part of the syntax of data.

The semantic of data refers to its meaning and is the most difficult metadata to produce. We require a unified framework that researchers in all scientific disciplines can use to create consistent, easily searchable metadata. Ease of use is paramount. Because the ability to share data is so important, we want the process of metadata creation to be as painless as possible. This means that we must start by creating an ontology for each domain in which we create data.

Ontologies evolve with time. a big challenge is to track this evolution with the metadata. For example, if we called a technique "machine learning" but then realize the term is too generic and we should call it "cluster analysis" because this is what we were doing anyway, we have to update also the old metadata. Data curation applies also to the metadata.

the evolution of terms

Some metadata can be computed from the data itself, for example, the descriptive statistics. At NASA, the automatic extraction of metadata from data content is called data archeology.

Friday, September 30, 2016

Navigating instead of searching

I believe it was on a day at the end of March or beginning of April 1996, when out of the blue I received the assignment to write a report on why the world wide web would be important for my employer at the time. I was given only two weeks to write a blurb and a slide deck. I had not thought about this matter and it was a struggle to write something meaningful in such a short time, without any time to read up. It ended up as an exercise in automatic writing: just write ahead and never look back and revise.

At the end, I delivered the blurb, but management decided the web was just a short-lived fad like citizen band (CB) radio that would go away shortly and put the blurb in the technical report queue for external publication. I did not think much about it because such is life in a research lab. However, with all the hype of the time on the "Internet Tsunami" by Bill Gates, the "Dot in the Dot Com" by Scott McNealy, and all the others—while my employer remained silent and kept everybody curious—the report W3 + Structure = Knowledge was requested hundreds of times (at that time a tech report was typically requested much less than ten times). Subsequently, I received quite a few requests to present the companion slide deck.

In my struggle to write something in two weeks, I typed about the need to structure information on the world wide web so it can be easily navigated, instead of searching for information.

little dancer leaping over the world wide web

It appears today we are at such a disruptive juncture again. This time, it is not about websites: it is about data (some prefer to call it big data, but size does not really matter). Solid state drives are now inexpensive, can hold 60 TB of data in each 3.5" drive, and have access times similar to RAM. In addition, we have all the shared data in the various clouds.

Today, we are not interested in finding data, we are maybe wiling to navigate to data, but preferably we would like the data to anticipate our need for it and come to us in digested, actionable form. Actually, we are not interested in the data, we are interested in data in a context: knowledge. We want the data to come to us and ask us if it is OK to take an anticipated action based on a compiled body of knowledge: wisdom.

This is an emergent property because all the pieces have fallen together. Our mobile devices and the internet of things constantly gather data. There are open source implementations of deep learning algorithms. CUDA 8 lets us run them on inexpensive Pascal GPGPUs like the Tesla P100. Algebraic topology analytics lets us build networks that compile knowledge about the data. Digital assistant technology brings this knowledge at our service.

Another key ingredient is the skilled workforce. Mostly Google, but also Facebook, Apple, Amazon, etc. have been aggressively educating their workforce in advanced analytics, and as brains move from company to company in the Silicon Valley, these skills are diffused in the industry.

In a recent New York Times article, G.E.'s Jeffrey R. Immelt explained how he is taking advantage of this talent pool in a new Silicon Valley R&D facility employing 1,400 people. Microsoft is creating a new group, the AI and Research Group, by combining the existing Microsoft Research group with the Bing and Cortana product groups, along with the teams working on ambient computing. Together, the new AI and Research Group will have some 5,000 engineers and computer scientists.

This is the end of search engines. This is the end of metadata: we want wisdom based on all the data.

Here is a revised version of the post I wrote in July.

The asumption is that to be useful, technology has to enable society to become more efficient so life quality increases. The increase has to be at least one order of magnitude.

Structured data

In the context of big data, we read a lot about structured versus unstructured data. So far, so good. Things get a little murky and confusing when advanced analytics—which refers to analytics for big data—joins the conversation. The confusion comes from the subtle difference between "structured data" and "structure of data," which contain almost the same words (their edit distance is 3). Both concepts are key to advanced analytics, so they often come up together. In this post, we will try to shed some light on this murkiness to clarify it.

The categorization in structured, semi-structured, and unstructured data comes from the storage industry. Computers are good at chewing on large amounts of data of the same kind, like for example the readings from a meter or sensor, or the transactions on cash registers. The data is structured in the sense that each record has the same fields at the same locations, for example on an 80 or 96 column punched card, if you want a visual image. This structure is described in a schema.

Databases are optimized for storing structured data. Since each record has the same structure, the location of the i-th record on the disk is i times the record length. Therefore, it is not necessary to have a file system: a simple block storage system is all that is needed. When instead of the i-th record we need the record containing a given value in a given field, we have to scan the entire database. When this is a frequent operation in a batch step, we can accelerate it by first sorting the records by the values in this field, which allows us to use binary search, which is logarithmic instead of linear.

Because an important performance metric is the number of transactions per second, database management systems use auxiliary structures like index files and optimizing query systems like SQL. In a server-based system, when we have a query, we do not want to transfer the database record by record to the client: this leads to server-based queries. When there are many clients, often the same query is issued from various clients, therefore, caching is an important mechanism to optimize the number of transactions per second.

Database management systems are very good at dealing with transactions on structured data. There are many optimization points that allow for huge performance gains, but it is a difficult art requiring highly specialized analysts.

Semi-structured data

With cloud computing, it has become very easy to quickly deploy a consumer application. The differentiation is no longer by the optimization of the database, but in being able to collect and aggregate user data so it can be sold. This process is known as monetization and an example is click-streams. The data is to a large extent in the form of logs, but their structure is often unknown. One reason is that the schemata often change without a notification because the monetizers infer them by reverse engineering. Since the data is structured with an unknown schema, it is called semi-structured. With the Internet of Things (IoT), also known as Web of Things, Industrial Internet, etc., a massive source of semi-structured data is coming towards us.

This semi-structured data is high-volume and high-velocity. This breaks traditional relational databases because data parsing and schema inference become a performance bottleneck. Also, the indexing facilities may not be able to cope with the data volume. Finally, the traditional database vendor's pricing models do not work for high volumes of less costly data. The paradigms for semi-structured data are column based storage and NoSQL (not only SQL).

In big data scenarios, structured data can have high-volume and high-velocity. Although it may be fully structured, e.g., rows of double precision floating point values from a set of sensors (a time series), a commercial database system might lose data when reindexing. Even a NoSQL database might be too slow. In this case, this structured data is treated as unstructured and each column is stored in a separate file for concurrent writes. Typically, the content of such a file is a time series.

Unstructured data

The ubiquity of smartphones with their photo and video capabilities and connectedness to the cloud has brought a flood of large data files. For example, when the consumer insurance industry thought it can streamline its operations by having insured customers upload images of damages instead of keeping a large number of claim adjusters in the field, they got flooded with images. While an adjuster knows how to document a damage with a few photographs, consumers take dozens of images because they do not know what is essential.

Photographs and videos have a variety of image dimensions, resolutions, compression factors, and duration. The file sizes vary from a few dozen kilobytes to gigabytes. They cannot be stored in a database other than as a blob, for binary large object: the multimedia item is stored as a file or an object and the database just contains a file pathname or the address of an object. In general, not just images and video are stored in blobs, therefore we use the more generic term of digital items. Examples of digital items in engineering applications are drawings, simulations, and documentation. In their 2006 paper, Jim Gray et al. found that databases can store efficiently digital items of up to 256 KB [1].

For digital items, in juxtaposition to conventional structured data, the storage industry talks about unstructured data.

Unstructured data can be stored and retrieved, but there is nothing else that can be done with it when we just look at it as a blob. When we look at analytics jobs, we see that analysts spend most of their time munging and wrangling data. This task is nothing else than structuring data because analytics is applied to structured data.

In the case of time series data, the wrangling is easy, as long as the columns have the same length. If not, a timestamp is needed to align the time series elements and introduce NA values where data is missing. An example of misaligned data is when data from various sources is blended.

In the case of semi-structured data, wrangling entails reverse engineering the schema, convert dates between formats, distinguish numbers and strings from factors, and dealing correctly with missing data. In the case of unstructured data, it is about extracting the metadata by parsing the file. This can be a number of tags like the color space, or it can be a more complex data structure like the EXIF, IPTC, or XMP metadata.

Structure in time series

A time series is just a bunch of data points, so at first one might think there is no structure. In a way, statistics, with its aim to summarize data, can describe the structure in raw data. It can infer its distribution and its parameters, model it through regression, etc. These summary statistics are the emergent metadata of the time series.

Structure in images

A pictorial image is usually compressed with the JPEG method and stored in a JFIF file. The metadata in a JPEG image consists of segments beginning with a marker, the kind of the marker, and if there is a payload, the length and the payload itself. An example of a marker kind is the type (baseline or progressive) followed by width, height, number of components, and their subsampling. Other markers are the Huffman table (HT), the quantization tables (DQT), a comment, and application-specific markers like the color space, color gamut, etc. This illustrates that unstructured data contains a lot of structure. Once the data wrangler has extracted and munged this data, it is usually stored in R frames, or in a dedicated HIVE or MySQL database. These allow processing with analytics software.

Deeper structure in images

Analytics is about finding even deeper structure in the data. For example, a JPEG image is first partitioned in 8×8 pixel blocks, which are each subjected to a DCT. Pictorially, the cosine basis (the kernels) looks like in this figure:

the kernels of the discrete cosinus transform (DCT)

The DCT transforms the data into the frequency domain, similar to the discrete Fourier transform, but in the real domain. We do this to decorrelate the data. In each of the 64 dimensions, we determine the number of bits necessary to express the values without perceptual loss, i.e., in dependence of the MTF of the combined HVS and capture & viewing devices. These numbers of bits are what is stored in the discrete quantization table DQT, and we zero out the lower order bits, i.e., we quantize the values. At this point, we have not reduced the storage size of the image, but we have introduced many zeros. Now we can analyze statistically the bit patterns in the image representation and determine the optimal Huffman table, which is stored with the HT marker, and we compress the bits, reducing the storage size of the image through entropy coding.

Like we determine the optimal HT, we can also study the variation in the DCT-transformed image and optimize the DQT. Once we have implemented this code, we can use it for analytics. We can compute the energy in each of the 64 dimensions of the transformed image. As a proxy for energy, we can compute the variance and obtain a histogram with 64 abscissa points. The shape of ordinates gives us an indication of the content of the image. For example, the histogram will tell us, if an image is more likely scanned text or a landscape.

We have built a rudimentary classifier, which gives us a more detailed structural view of the image.

Let us recapitulate: we transform the image to a space that gives a representation with a better decorrelation (like transforming from RGB to CIELAB, then from Euclidean space to cosine space). Then we perform a quantization of the values and study the histogram of the energy in the 64 dimensions. We start with a number of known images and obtain a number of histogram shapes: this is the training phase. Then we can take a new image and estimate its class by looking at its DCT histogram: we have built a rudimentary classifier.

An intuition for deep learning

We have used the DCT. To generalize, we can build a pipeline of different transformations followed by quantizations. In the training phase, we determine how well the final histogram classifies the input and propagate back the result by adjusting the quantization table, i.e., to fine-tune the weights making up the table elements. In essence, this is an intuition for what happens in deep learning, or more formally, in CNN.

In the case of image classification and object recognition, the CNN equivalent of a JPEG block is called receptive field and the generalization of the DCT DQT is called CNN filter or CNN kernel. The CNN kernel elements are the weights, just as they are the coordinates in cosine space in the case of JPEG (this is where the analogy ends: while in JPEG the kernels are the basis elements of the cosine space, in CNN the kernels are the weights and the basis in unknown). The operation of applying each filter in parallel to each receptive field is a convolution. The result of the convolution is called the feature map. While in JPEG the kernels are the discrete cosine basis elements, in CNN the filters can be arbitrary feature identifiers. While in JPEG the kernels are applied to each block, in CNN they “slide” by a number of pixels called the stride (in JPEG this would be the same as the block size, in CNN it is always smaller than the receptive field).

The feature map is the input for a new filter set and the process is iterated. However, to obtain a faster convergence, after the convolution it is necessary to introduce a nonlinearity, in a so-called ReLU layer. Further, the feature map is down-sampled, using what in CNN is called a pooling layer: this helps to avoid to overfit the training set. At the end, we have the fully connected layer, where for each class being considered, there is the probability for the initial image to belong to that class:

CC Aphex34 (Wikimedia Commons)

The filters are determined by training. We start with a relatively large set of images for whom we have determined manually the probabilities to belong to the various classes: this is the ground truth and the process is called groundtruthing. The filters can be seeded with random patterns (in reality we use the filters from a similar classification problem: transfer learning or pre-training, see the figure below) and we apply the CNN algorithm to the ground truth images. At the end, we look at the computed connected layer and compare it with the ground truth obtaining the loss function. Now we propagate the errors back to reduce the weights in the filters that caused the largest losses. The process of forward pass, loss function back-propagation, and weight update is called an epoch. The training phase is applied over a large ground truth over many epochs.

Evolution of pre-trained conv-1 filters with time; after 5k, 15k, and 305k iterations, from [2]:

Evolution of pre-trained conv-1 filters with time; after 5k, 15k, and 305k iterations, from Agrawal et al.

CNN and all the following methods are only possible because of the recent progress in GPGPUs.

A disadvantage of machine learning is that the training phase takes a long time and if we change the kind of input we have to retrain the algorithm. For some applications, you can use your customers as free workers, like for OCR you can use the training set as captcha, which your customers will classify for free. For scientific and engineering applications, you typically do not have the required millions of free workers. This is the motivation for unsupervised machine learning.

Assimilation and accommodation

So far, we took a random walk from unstructured data to multimedia files, JPEG compression, a DCT-inspired classifier, and deep learning. We saw that the crux of supervised machine learning is the training.

There are two reasons for needing classifiers. We can design more precise algorithms if we can specialize them for a certain data class. For humans, the reason is that our immediate memory can hold only 7 ± 2 chunks of information. This means that we aim to break down information into categories each holding 7 ± 2 chunks. There is no way humans can interpret the graphical representation of graphs with billions of nodes.

As already Immanuel Kant noted, categories are not natural or genetic entities, they are purely the product of acquired knowledge. One of the functions of the school system is to create a common cultural background, so people learn to categorize according to similar rules and understand each other's classifications. For example, in the biology class, we learn to organize botany according to the 1735 Systema Naturæ compiled by Carl Linnæus.

As we know from Jean Piaget's epistemological studies with children, there is assimilation when a child responds to a new event in a way that is consistent with an existing classification schema. There is accommodation when a child either modifies an existing schema or forms an entirely new schema to deal with a new object or event. Piaget conceived intellectual development as an upward expanding spiral in which children must constantly reconstruct the ideas formed at earlier levels with new, higher order concepts acquired at the next level.

The data scientist's social role is to further expand this spiral. Concomitantly, data scientists have to be well aware of the cultural dependencies of acquired knowledge.

For data, this means that we want to cluster it (recoding by categorization). Further, we want to connect the clusters in a graph so we can understand its structure (finding patterns). At first, clustering looks easy: we take the training set and do a Delaunay triangulation, the dual graph of the Voronoi diagram. After building the graph with the training set, for a new data point, we just look in which triangle it falls and know its category. Color scientists are familiar with Delaunay triangulations because they are used for device modeling by table lookup. Engineers use them to build meshes for finite element methods.

The problem is that the data is statistical. There is no clear-cut triangulation and points from one category can lie in a nearby category with a certain probability. Roughly, we build clusters by taking neighborhoods around the points and the intersect them to form the clusters. The crux is to know what radius to pick for the neighborhoods because the outcome will be very different.

Algebraic topology analytics

This is where the relatively new field of algebraic topology analytics comes into play [4]. It has only been about 15 years that topology has started looking at point clouds. Topology, an idea of the Swiss mathematician Leonhard Euler, studies the properties of shape independent of coordinate systems, dependent only on a metric. The topological properties are deformation invariant (a donut is topologically equivalent to a mug). Finally, topology constructs compressed representations of shape.

The interesting element of shape in point clouds are the k-th Betti numbers βk, the number of k-dimensional "holes" in a simplicial complex. For example, informally β0 is the number of connected components, β1 the number of roundish holes, and β2 the number of cavities.

Algebraic topology analytics relieves the data scientist from having to guess the correct radius of the point neighborhoods by considering all radii and retaining only those that change the topology. If you want to visualize this idea, you can think of a dendrogram. You start with all the points and represent them as leaves; as the radii increase, you walk up the hierarchy in the dendrogram.

This solves the issue of having to guess a good radius to form the clusters, but you still have the crux of having to find the most suitable distance metric for your data set. This framework is not a dumb black-box: you still need the skills and experience of a data scientist.

The dendrogram is not sufficiently powerful to describe the shape of point clouds. The better tool is the set of k-dimensional persistence barcodes that show the Betti numbers in function of the neighborhood radii for building the simplicial complexes. Here is an example from page 347 in Carlsson's article [4]:

(a) Zero-dimensional, (b) one-dimensional, and (c) two-dimensional persistence barcodes


With large data sets, when we have a graph, we do not necessarily have something we can look at because there is too much information. Often we have small patterns or motifs and we want to study how a higher order graph is captured by a motif [5]. This is also a clustering framework.

For example, we can look at the Stanford web graph at some time in 2002 when there were 281,903 nodes (pages) and 2,312,497 edges (links).

Clusters in the Stanford web graph

We want to find the core group of nodes with many incoming links and the tied together periphery groups that are tied together and also up-link to the core.

A motif that works well for social network kind of data is that of three interlinked nodes. Here are the motifs with three nodes and three edges:

Motifs for social networks

In motif M7 we marked the top node in red to match the figure of the Stanford web.

Conceptually, given a higher order graph and a motif Mi, the framework searches for a cluster of nodes S with two goals:

  1. the nodes in S should participate in many instances of Mi
  2. the set S should avoid cutting instances of Mi, which occurs when only a subset of the nodes from a motif are in the set S

The mathematical basis for this framework are motif adjacency matrices and the motif Laplacian. With these tools, a conductance metric in spectral graph theory can be defined, which is minimized to find S. The paper [5] in the references below contains several worked through examples for those who want to understand the framework.


  1. R. Sears, C. van Ingen, and J. Gray. To BLOB or not to BLOB: Large object storage in a database or a filesystem? Technical Report MSR-TR-2006-45, Microsoft Research, One Microsoft Way, Redmond, WA 98052, April 2006.
  2. P. Agrawal, R. Girshick, and J. Malik. Analyzing the Performance of Multilayer Neural Networks for Object Recognition, pages 329–344. Springer International Publishing, Cham, September 2014.
  3. G. A. Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2):81–97, 1956
  4. G. Carlsson. Topological pattern recognition for point cloud data. Acta Numerica, 23:289–368, 5 2014
  5. A. R. Benson, D. F. Gleich, and J. Leskovec. Higher-order organization of complex networks. Science, 353(6295):163–166, 2016

Wednesday, September 28, 2016

Extreme color naming experiment finds locus for luma–chroma transformation

From time to time, a new physiological experimentation technique or a significant new instrument is developed, leading to breakthrough discoveries. In the case of color vision, this usually entails a doctoral student and their significant other—and maybe some additional dedicated colleagues or the professor—to undergo the cruel ordeal of having their pupils dilated (mydriasis) and their ciliary muscle paralyzed to avoid accommodation (cycloplegia), then get strapped to a headrest while biting a dental impression mount, to make observations in repeated interminable sessions for months on end, all in the name of science.

In their recent paper The elementary representation of spatial and color vision in the human retina, Ramkumar Sabesan et al. report on a seminal study to locate where in the human visual system (HVS), the luma-chroma encoding occurs in the parvocellular pathway (midget ganglion cells).

This study by Ramkumar Sabesan et al. represents the first time cone photoreceptors of known spectral type have been individually targeted and activated with light in the living human retina.

I cannot believe, it has been thirty years since I drew this diagram:

cognitive model

Although the above diagram looks like a model for the HVS, it was more a plan for my implementation of a color workbench. To keep the head cool and prevent it from overheating, our brain evolved to minimize the usage of energy. This is accomplished by having pipelines where at each stage the information gets recoded to make it more complex but more compact. This is at the cost of speed: while a two-photon catch the shift in electron density takes less than a femtosecond, the entire photo-cycle lasts a picosecond and at the end of the pipeline, adaptation can take seconds and color naming minutes.

An important feature were the bidirectional arrows: we have a feedback loop with control moving down and information moving up. Because of the sequence of recoding and the feedback, the receptors in the retina are not like pixels in a CCD sensor

  • Receptive field: area of visual field that activates a retinal ganglion (H.K. Hartline, 1938)
  • Center-surround fields allow for adaptive coding (transmit contrast instead of absolute values)
  • Horizontal cells presumed to inhibit either its bipolar cell or the receptors: opponent response in red–green and yellow–blue potentials (G. Svaetichin, 1956)
  • Retinal ganglion can be tonic or phasic: pathway may also be organized by information density or bandwidth

The last item comes from a table of the parvocellular and magnocellular pathways Lucia Rositani-Ronchi compiled for me at the 1993 AIC meeting in Budapest:

Originating retinal ganglion cells
Temporal resolution
Slow (sustained responses, low conduction velocity)
Fast (mostly transient responses, some sustained, high conduction velocity)
Modulation dominance
Adaptation occurs at high frequencies
Adaptation occurs at all frequencies
Receives mostly opponent type input from cones sensitive to short and long wavelengths
Receives mostly combined (broadband) input from M and L cones, both from the center and from the surround of receptive fields
Contrast sensitivity
Low (threshold > 10%)
High (threshold < 2%)
LGN cell saturation
Linear up to about 64% contrast
At 10%
Spatial resolution
High (small cells)
Low (large cells)
Spatio-temporal resolution
When fixation is strictly foveal, extraction of high spatial frequency information (test gratings), reflecting small color receptive fields
Responds to flicker
Long integration time
Short integration time
Relation to channels
Could be a site for both a lightness channel as for opponent-color channels. The role depends on the spatio-temporal content of the target used in the experiment
Might be a site for achromatic channels because the spectral sensitivity is similar to Vλ, it is more sensitive to flicker, and has only a weak opponent color component
Possible main role in the visual system
Sustain the perception of color, texture, shape, and fine stereopsis
Sustain the detection of movement, depth, and flicker; reading of text

We have four retinal pigments (erythrolabe, chlorolabe, cyanolabe, rhodopsin) attached by a lysine to a protein backbone. These four pigments are sensitized to photons at 4 energy levels (wavelengths): L, M, S, and rods. The energy levels are not numbers but distributions, namely the probabilities for a photon catch with that chromatophore.

A 3-dimensional signal with L, M, S is not efficient because we need a high spatial resolution but the chromatic information can be at a lower resolution. This is reflected in the modulations transfer functions for the HVS and is exploited for example in image encoding, where we transform an RGB signal into a color opponent signal and then down-sample the chroma images:

CIELAB separations

In 1993, it was not known where this transformation occurs in the HVS. In fact, there is quite a bit of processing in the retina, and many details are still unknown.


In their recent paper The elementary representation of spatial and color vision in the human retina, Ramkumar Sabesan et al. report on a seminal study to locate where in the HVS, the luma-chroma encoding occurs in the parvocellular pathway.

Using an adaptive optics scanning laser ophthalmoscope (AO-SLO), the authors studied 174 L-cone, 99 M-cone, and 12 S-cone samples by stimulating them individually with a 543 nm, 500 ms pulse and asking two subjects to report the perceived color name. The names were restricted to red, green, blue, yellow, white, and not seen.

The subjects reported achromatic sensations 61.8% of the time. When red was reported (22.5% of seen trials), it was more likely to be driven by L- than M-cones, whereas green (15.7%) was more likely to come from the excitation of M-cones. Thus, L-cones tended to signal both white and red, whereas M-cones tended to signal both white and green. The observation that these color percepts roughly align with the predictions of large-field cone-isolating stimuli suggests that the same opponent neuronal circuits may be implicated in both paradigms. This finding also supports the idea that the visual system can learn the spectral identity of individual cones.

The apparent segregation of color categories into distinct populations of cells is suggestive of a parallel representation of color and achromatic sensations. Moreover, these results imply that, for a large number of cones, their individual activation is not sufficient to produce a color. (Remember that in this experiment single cones are excited; in free vision, most cones are activated and the eye saccades, presenting a point in the visual field to several cones.)

The authors found that the cones most likely to generate strong spectral opponency in a parvocellular neuron, that is, those surrounded by cones of opposing type, were not more likely to generate red or green percepts. Rather, all these examples, when stimulated in isolation, drove achromatic percepts on a majority of the trials.

There is little doubt that the long-duration supra-threshold stimulation of individual cones here influences the firing of a number of different ganglion cell types. In particular, a multi-electrode study demonstrated that the activation of a single cone simultaneously evoked responses in both midget (parvocellular) and parasol (magnocellular) ganglion cells. The results may be particularly informative in differentiating proposals about the role of parvocellular neurons in achromatic spatial and color vision.

The study confirms the old result that the red-green system samples the visual world at a lower resolution than the achromatic system. The new results from the studies reported in the present paper are consistent with the idea that the HVS represents these two pieces of information with separate pathways that emerge as early as the photoreceptor synapse: one chiefly concerned with high-resolution achromatic vision and a second, lower-resolution color system.

The luma-chroma transformation with chroma subsampling is very important in image processing. In your opinion, does this new result allow the design of better imaging pipelines? Does this allow us to design better retinex algorithms? Join the conversation in the Trellis group.

Citation: R. Sabesan, B. P. Schmidt, W. S. Tuten, A. Roorda, The elementary representation of spatial and color vision in the human retina. Sci. Adv. 2, e1600797 (2016).