Thursday, October 3, 2013

disk service time

In the last post we saw how hard disk drives (HDD) are color coded. I hinted on how to choose the color of a HDD, suggesting that for the main disk a solid state drive (SSD) is actually a better choice, but I left things fuzzy. The reason is that there is no single metric, you have to determine what your work day looks like. Fortunately there is one thing that no longer is an issue: capacity.

When microprocessors hit the performance wall, the solution was to introduce multiple cores. This means that you must parallelize your color analysis algorithms so the threads can run concomitantly in different hyper-threads. The problem becomes how you write out the image with the enhanced colors.

One solution could be to have a separate thread that collects the results of each thread and writes the overall result file. However, this thread would become a bottleneck, so it is not a good approach. A different solution could be to have each thread writing a file, then combine the results in a post-processing step. This is also not a good approach, because the merge step would be performed in a single thread.

Fortunately, high performance computing has always been parallel, so there are good solutions at hand. The most popular one is the Message Passing Interface (MPI), for which there are several open source implementations. The parallel I/O feature is sometimes called MPI-IO, and it can challenge a disk system. Consequently, in color image processing we now have I/O problems at a level previously know only for web services and database transaction systems.

With this, computing the number of required terabytes is the easy part. The hard metric to determine is the maximum number of I/Os serviced per second (IOPS), or its inverse, the disk service time. Along with the disk utilization rate (U), the disk service time determines the I/O response time for an application. It is the sum of the seek time T, rotational latency L, and the internal transfer time X: TS = T+L+X.

The EMC book "Information Storage and Management" has an example on page 43. Assume a disk with a seek time of 5 ms for random I/O, 15,000 rpm, and 40 MB/s internal transfer rate with 32 KB block size. Then TS = 5 + (0.5/250) + 32/40 = 7.8 ms, yielding a maximum number of I/Os serviced per second or IOPS of 1 / TS = 1 / (7.8 ⋅ 10-3) = 128 IOPS.
The application response time R increases with disk controller utilization. In the above example, with a 96% utilization we obtain a response time of R = TS / (1 - U) = 7.8 / (1 - 0.96) = 195 ms. The response time can be reduced by reducing the disk utilization rate below 70%, but then the number of IOPS is also reduced.

When DC is the number of disks required to meet capacity and DI is the number of disks required to meet the application IOPS requirement, the number of disks required is DR = max (DC, DI). Continuing the example, if an application requires 1.46 TB and generates a peak workload of 9,000 IOPS, then using a 146 GB, 15,000 rpm drive capable of 180 IOPS gives TC = 1.46 T / 146 G = 10 disks. However, TI = 9000/180 = 50, but if the application is response-time sensitive, 70% disk utilization is more realistic, resulting in 126 IOPS or TI = 9000/126 = 72 and DR = max (DC, DI) = 72 disks.

In summary, in this case 10 disks would be sufficient from a capacity point of view, but to fulfill the disk service time requirement we need 72 disks. This is why in practice solid state drives (SSD or flash memory) are the most economical solution to your storage problems.

In case you think this example is not typical for SOHO use, here is a picture of the City owned electrical service entrance at the curb of a googler's residence a few blocks from our garage.

Once you wire up every control point and sensor in your house with Cat-6 cabling and run a deep learning algorithm to recognize the cats trespassing on you property on 16,000 CPU cores, you need 38 KV service and an 800 A connection box. Not shown on the picture is the backup generator which is on private property, as well as the fuse box.

Luckily at the Mostly Color HQ we do not have cat problems, all we need is the ability to store our color science library and our test media. In the next post we will explore how we can achieve this with an electrical power consumption similar to that of a light bulb.

No comments:

Post a Comment