Friday, July 17, 2015

What can be addressed vs what can be seen

Remember the early days of color in consumer PCs? They started out with 256 colors or the slightly less "web safe" colors and progressed to "thousands of colors" (or "high color"). Just a short time later, 24-bit display controllers became sufficiently cheap to make full (or "true") color viable for consumer PCs. The marketing slogan at the time was pushing the display capability as "over 16 million colors."

We color scientists cringed at this claim and tried to explain to the marketeers that the human visual system can only discriminate 4 to 7 million colors, consequently the marketing collaterals should use the adjective "addressable": the product can address 16,777,216 colors.

A similar issue is surfacing in the storage technology sector. Consumers are told they can store terabytes (TB) of images or video (for free or almost) and businesses are offered petabyte (PB) size storage, while storage vendors claim their systems can scale to exabyte (EB) scale. In reality this is not what can be used but what can be addressed.

For example, my home NAS is a RAID of two 4 TB drives. The parts cost about $500; could I have saved this money and assembly time by using a free cloud service? The answer is a simple no, because on my VDSL connection it would take me 2 years to upload 4 TB of data.

In business applications the situation is similar. When you have PBs of data, you can no longer do backups because they take too long. You have to replicate the data and even with the latest technologies copying a TB takes 30 minutes, which is a long time if you have EB scale data. Even reformatting a drive with too many IO failures takes too long in a PB scale datacenter and self-encrypting drives are the only viable solution.

How can we find a measure for the practically usable storage size as opposed to the addressable storage? The previous paragraph suggests that a possible metric is the time required to maintain the high availability (HA) of your data. You may have enough bandwidth to ingest a large volume of data, but hopefully you are also reading and processing it. To this you have to add the bandwidth the storage system needs to maintain at least three replicates of the data.

This sounds easy, but it is not realistic. In modern storage the key measure is the scalability. A recent CACM paper is an excellent introduction on how to study your system using Gunther's Universal Scalability Law (USL). This paper analyzes the speedup of TeraSort on AWS EC2. The figure shows the modeled speedup with parameters σ = −0.0288, κ = 0.000447 for c1.xlarge instances optimized for compute with 8 virtual Xeon cores, 7 GiB memory, 4 × 420 GB instance storage and high network performance.

When considering the practical scalability of systems, we are interested only in the linear portion of the curve. Gunther's USL teaches us that for the out-of-the-box version of the Hadoop TeraSort MapReduce workload you can only scale linearly to 48 nodes, then you are out of steam (in their experiments the authors limited the TeraSort dataset to 100 GB).

TeraSort scalability on EC2

Note that I explicitly wrote "Hadoop TeraSort MapReduce workload." This is because there is no single magic number. When you plan the configuration of your system, you have to carefully determine your workloads and measure their performance to estimate the parameters for the USL model.

The volume of your data is probably given by your application. The USL model allows you to optimize the system configuration so it has the required elasticity. The CACM paper shows how you optimize your application. By optimizing parameters like timeouts and cache sizes the authors were able to increase the speedup by 30% and extend the linear portion from 48 nodes to 95 nodes. This is a substantial improvement.

It is a good exercise to put the USL model in a graphing application (e.g., the free Grapher in the Utilities folder of MacOS) and animating the parameters.

The linear parameter σ (i.e., the coefficient of the linear term) is a correlate of contention. This is the parameter optimized in scale-up systems or vertical systems. An example of a "knob" you can turn in the system to control contention is the bandwidth. This can be the LAN speed, the number of local drives, etc. What you learn by animating the linear parameter is that by optimizing a scale-up system you can increase the speed-up. However, the linear portion does not change: the range of nodes in which the system scales is constant.

Recently the cost of enterprise class 2 TB SSDs has dropped below $1000, which due to the greater longevity brings the total cost of ownership down to that of rotating disks. Consequently many have tried to replace their HDDs with SSDs, and the Internet is full of academic papers, newsletters and blog posts from people lamenting that their storage systems have not become faster. First, they were just changing one component of contention, so scalability is not improved. Second, in a system you cannot just change out one component and hope for better performance, especially if it was not a bottleneck: you have to optimize the whole system using a USL model.

The quadratic parameter κ is a correlate of coherency. This accounts for the cost of message passing, for example for cache consistency and locking. This is the parameter optimized in scale-out systems or horizontal systems. What you learn by animating the quadratic parameter is that by optimizing a scale-out system you can increase the speed-up and also the linear portion increases: the range of nodes in which the system scales is variable.

Optimizing intra-node communication is a scheduling problem and is difficult. However, the payoffs can be huge. For example, there have been reports in the literature that when instead of managing the TRIM operation at the SSD drive level it is managed at the system level, very substantial speed-ups and scalability extensions are achieved.

Software defined storage (SDS) systems are typically scale-out systems. Looking at the recent business results of storage companies, it is clear that scale-up system sales are declining at the expense of scale-out system sales. Large vendors are investing large sums in the development of SDS technologies.

Using Gunther's USL model should not only encompass the storage system. The big payoffs come when the algorithms are part of the model. For example, Hadoop is from a time of slow networks and today Spark is the better framework for distributed computing.

If you are into big data, now is the time to become intimately familiar with Gunther's Universal Scalability Law.