Wednesday, October 28, 2015

Data Science Camp

Last Saturday (October 24, 2015) was SF Bay ACM's annual Data Science Camp Silicon Valley. The venue was the Town Hall in PayPal's intergalactic headquarters on 2161 North 1st Street in San Jose, nestled between eBay and Apple's future Car Division, just one block from Comet, Foveon, Lumileds, Peaxy, and Toyota's research lab.

From the voting on the sessions, it appears that the event attracted data scientists from all over the US, with a large number of participants people who had taken the Coursera classes on big data, machine learning, and data science and were now wondering how to progress from Coursera to a job (82 votes). As Sara Kalisin from Intel noted, when companies try out analytics, they do not really know what to do with the result and end up not staffing the project because the benefit is less than the employee's salary. In addition to the session on "Coursera to job," Sara also led a session with title "How to showcase data
science impact" (15 votes).

At the beginning of the day, Joseph Rickert and Robert Horton of Microsoft gave the tutorial "Introduction to R for Machine Learning." R, which goes back to Bell Labs in 1976, has become the most widely used data analysis software. It is undergoing an astonishing growth and today has about 36,820 functions. The CRAN repository has a solution for almost all data analysis problems.

Originally, R was a statistical package with statistical tools acting on observations stored in arrays. There were different packages, like Weka, for mining large data sets stored in files with machine learning to classify patterns and make predictions. However, today R has all the machine learning functionality on top of the original statistical tools. This has been possible because today a serious workstation has at least 64 GB of RAM, which allows to store big data in arrays.

When the data is too large to fit in memory, it can be partitioned into blocks which can be processed sequentially or in parallel. However, this capability is not available with the free version of R and requires the expensive commercial enterprise version. Robert Horton announced that SQL Server 2016 will support the server-side execution of R functions. This means that the data no longer will have to be moved across the network for analysis.

After the sponsored lunch, John Park from HP led a double session with title "Malware Classification +
ML + Crowd Sourcing" (46+44 votes). The amount of malware injected every day is mind-boggling. He uses an algorithm called Nilsimsa Hash on the binary files and uses natural language processing and classifiers trained through crowd-sourcing to find the malware.

Another very popular session with the title "I have data. Now What? IOT wearable space in life sciences. How to analyze the data for individual users. How do we build that" (117 votes) was led by Lee Courtney from Qurasense. Representatives from large international financial institutions and the health care industry participated in this interactive session. The only company permanently storing all data and continuously mining all of it was a manufacturer of set-top boxes for the cable industry.

For everybody else, storing the data was just too dangerous because of the flood of malware, while Hadoop has no security. This requires ingesting the data, mining it, then deleting it. Because the HDFS data ingest is very slow and each file must be stored in three copies, as little data as possible is preserved. At the end of the session, Lee Courtney summarized the top three unsolved problems for big data as

  1. no security
  2. no security
  3. poor ingest performance

As a note, there are file systems with excellent security and supporting HDFS. They have a POSIX interface, so it is not necessary to move the data at all.

Moving the data is a big problem that will not go away, as illustrated by this figure by DataCore. Until about eight years ago, processors kept getting faster. However, when the wall of physical transistor shrinking was hit, the clock rates actually became slower to control the heat generation and allow for more cores on each die. With the increasing number of cores—for example, a typical workstation processor has eight cores each with two hyperthreads, for a total of 16 cores—less IO bandwidth is available to each CPU: the number of pins on the chip is still the same, and there is a growing IO gap.


The IO gap is rapidly increasing

Some people believe that cloud computing and virtual machines solve the problem. However, for storage this is an illusion. Indeed, according to the Hennessy/Patterson rules of thumb, for general computing the utilization rate is about 0.20–0.25 [HOHY14]. For storage servers, the utilization rate is 0.6–0.8 [AFG+10], therefore statistical multiplexing is less useful because with the OS and hypervisor overheads a processor is maxed out. The IO gap comes on top of this!

[AFG+10] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, et al. A view of cloud computing. Communications of the ACM, 53(4):50–58, 2010.

[HOHY14] Md. Iqbal Hossain (Older) and Md. Iqbal Hossain (Younger). Dynamic scaling of a web-based application in a cloud architecture. Master’s thesis, School of Information and Communication Technology, KTH Royal Institute of Technology, Stockholm, 2014.

Thursday, October 22, 2015

Photonics for data access and delivery

When we use the cloud for heavy duty computations, we quickly find out that although storage is inexpensive, we never know how fast we can access our data. If we want guaranteed performance in terms of IOPS, the price quickly goes up. This has to do with the distance the data has to travel. In a warehouse-scale datacenter, we have two cabling lengths: up to a couple meters for vertical networking in a rack and up to a couple hundred meters for horizontal networking between racks. There are some game-changing technology developments on the horizon regarding the horizontal networking.

For the vertical cabling, copper can be used, but for the horizontal cabling fiber optics has to be used due to dispersion, as shown in the figure below.

dispersion in cabling

Since the processor is electronic, at each end of the optical fiber cable we need an expensive transducer. Chip companies have invested substantial resources trying to grow lasers on CMOS chips to feed an optical cable, but the physical manufacturing process was very messy and never worked well.

The breakthrough came a dozen years ago with the development of nanotechnology to fabricate devices based on the optical bandgap. This allowed to take a PowerPC processor, make it flat, then use nanotechnology to grow an optical resonance ring on top of it. Now the laser source can be external to the chip and make a round in the ring while being modulated from the PowerPC. This was a seminal breakthrough.

At the end of July, Jürg Leuthold, professor of photonics and communications at ETH Zurich, and his colleagues had another seminal breakthrough. Hitherto, miniaturization of the modulator has been limited by the wavelength of the laser. In order to beat that limit and to make the device even smaller, the light is first turned into so-called surface plasmon polaritons (SPP). Plasmon polaritons are a combination of electromagnetic fields and electrons that propagate along a surface of a metal strip. At the end of the strip, they are converted back to light once again. The advantage of this detour is that plasmon polaritons can be confined in a much smaller space than the light they originated from. The signal is created by modulating the plasmon-polaritons in an interferometer.

By applying a voltage, the refractive index and hence the velocity of the plasmons in one arm of the interferometer can be varied, which in turn changes their amplitude of oscillation at the exit. After that, the plasmons are re-converted into light, which is fed into a fiber optic cable for further transmission.

This is the cheapest modulator ever built. It is very simple, consisting of a gold layer on glass that is only 150 nm thick and an organic material whose refractive index changes when an electric voltage is applied and that thus modulates the plasmons inside the interferometer. As such a modulator is much smaller than conventional devices and it consumes very little energy—only a few thousandth of Watts at a data transmission rate of 70 Gigabits per second. This corresponds to merely a hundredth of the consumption of commercial models.

Source: Nature Photonics 9, 525–528 (2015), doi:10.1038/nphoton.2015.127.

New technologies take a few years to evolve from universities to industrial research labs and then to industry. One of the accelerators of this process is CERN, in particular, the Large Hadron Collider (LHC) project. The detectors produce about 25 PB of data each year, which travel through more than 100,000 optical links to obtain the required bandwidth. The distance from the detectors to the control rooms is about 150 m and the connectors have to be able to withstand the 3.8 Tesla magnetic field in the detector and enormous levels of radiation.

For this application, scientists at CERN have developed so-called “versatile-link” optical components with a minimal energy consumption. By 2018 Marcel Zeiler and his colleagues will have new modulators—of course, also based on interferometry—that can withstand the harsh environment in the LCH.

Source: SPIE Professional October 2015.

Although in a datacenter the radiation is negligible and the magnetic fields are very far from Teslas, the experience is that CERN technology transitions to industry very fast, so we should not be surprised to see new generation versatile optical links in a year or two at most. The capability of economically moving data for hundreds of meters on 100 Gigabit Ethernet (100GbE) links renders old architectures like Hadoop moot because there is no reason for moving the data to HDFS for MapReduce.