Monday, June 10, 2013

The power of crowd sourcing

This morning on the local radio in the transmission Morning Edition there was a short piece on the new NSA data farm in Utah, which is supposed to go on-line this September. The piece stated that the data farm will store 5 zettabytes, and the old data farm in Virginia, which will remain on-line, has about 2/3 of the capacity.

These 8 zettabytes are contributed by us aliens, i.e. non citizens: this makes it crowd sourced data. How does this compare to the data that the best and brightest scientists in the world can create? At CERN, the CERN Data Centre has recorded over 100 petabytes of physics data over the last 20 years; collisions in the Large Hadron Collider (LHC) generated about 75 petabytes of this data in the past three years; the bulk of the data (about 88 petabytes) is archived on tape using the CERN Advanced Storage system (CASTOR) and the rest (13 petabytes) is stored on the EOS disk pool system — a system optimized for fast analysis access by many concurrent users. For the EOS system, the data are stored on over 17,000 disks attached to 800 disk servers; these disk-based systems are replicated automatically after hard-disk failures and a scalable namespace enables fast concurrent access to millions of individual files.

A zettabyte is 270 bytes and a petabyte is a paltry 250 bytes, indicating that crowd sourcing can yield 5 orders of magnitude more data than the best scientists can. And while the scientists use the most powerful particle smasher ever built by human kind, the crowd just uses their fingers on plain old keyboards.

The more mind-boggling data point is that at some point the NSA may want to synchronize the data in the two farms. To get an idea of the required bandwidth, consider that backing up a 1 terabyte (240 bytes) solid state disk to a top-of-the-line external disk over a FireWire 800 connection takes 5:39:39 hours…

CERN data centre

Servers at the CERN Data Centre collected 75 petabytes of LHC data in the last three years, bringing the total recorded physics data to over 100 petabytes (Image: CERN)