Advances in computing in particular have led to the ability to readily amass mind-boggling amounts of data into databases. One advance that makes this possible is the increasing capacity and shrinking costs of mass storage. The other day I saw a 1 GB memory stick for $39.00 and a 250 GB hard drive for $80.00. How many of you remember the days when getting a 10 MB hard drive on your computer was leading edge?
Really Big Bytes
Some types of large datasets have very well-defined formats and processing algorithms, and can be readily converted into a user-friendly form. Movie files, still pictures, and audio files are examples. You can watch a movie file consisting of gigabytes of data in a matter of a few hours. The problems occur when the data can't be converted into an easy-to-examine form, or when you are looking for complex relationships in the data, or when you don't really know what you are looking for. In these cases you begin to realize that handling large sets of data becomes a non-trivial issue.
For one example of a large numerical database of potentially widespread interest, consider stock market historical prices. The New York Stock Exchange has ~2700 issuers. From one source of historical data I found, a five-year daily history of indices such as trading volume and high, low, and closing prices occupied ~50 KB for a single issue, so I could extrapolate a 20-year database of all 2700 issues to be roughly 540 MB, only partially filling a single CD-ROM. If I knew what I was looking for, predictive knowledge gleaned from these data could potentially become very lucrative. Needless to say, I don't know what to look for in this dataset or I would be doing something very different from writing this blog right now (This is the "Money" part of today's blog.)
Not the Money Part
For a somewhat larger, though non-numerical, dataset, consider the human genome, which consists of approximately three billion base pairs of amino acids, which are commonly denoted by their first letters (A,T,G,C). If you wrote this out as a text file it would occupy 3 GB, filling roughly 5 CD-ROMs. Figuring out what all this information means, and how the 30,000 or so distinct genes it encodes all function and interact will be keeping scientists busy for decades to come. (This is the "Sex" part—disappointed?)
And the Data
Now after considering these high-profile databases, consider the following modest laboratory scenario. You are performing reliability testing of a number of devices, and want to monitor their performance during a 1000 hour test. There are a total of 32 devices, and you have to monitor two output variables from each device at the rate of 200 samples/second, with 16-bit resolution. This requires a data acquisition card with 64 inputs and an aggregate sample rate of 12,800 samples/second—not exactly leading-edge performance these days, or even 10 years ago. Because environmental testing facilities are very expensive, you do not want to have to repeat the test, so you will just record everything and analyze the data at the end. How much data do you have to deal with?
Well, 1000 hours X 3600 seconds/hour X 12,800 samples/second X 2 bytes/sample gives you ~92 GB—a dataset 30 X as large as the human genome, and almost 200 X larger than our hypothetical stock market dataset. It fills something like 150 CD-ROMs. But note that it still fits on your $80.00 hard disk drive, with lots of room to spare. You aren't going to be able to load all this into Excel, however, and do some quick plots to see what you acquired. Exactly what do you do with all these data?
As sensors are increasingly used for monitoring applications, the ability to cheaply store enormous amounts of data will invariably result in the compilation of equally enormous databases. The scenario described above will become more and more common. With some luck, new generations of visualization and data-mining tools will be developed to help sift through the gigabytes of data that are going to be generated on a regular basis. Otherwise, we stand to be buried to our eyeballs in expensively obtained but worthless data.