You’ve probably heard the buzz about Big Data, and stories of data mining technologies (i.e. Hadoop’s solutions) that take business analytics to the HPC level, making it possible to make sense of massive quantities of unstructured data. And you may have heard the buzz from storage vendors about how their highly scalable disk platforms enable all this number crunching.
But if all this data has reached a magnitude to now be branded Big Data, does it make sense to keep it all on disk? And what about long-term storage of the source data that not only drives the analysis today, but is likely to be needed again at some point in the future? How many spinning disks does it take to store all the interactions of consumers on the Web or to store years of satellite images at a half-meter GSD resolution?
The unsurprising answer is that Big Data requires scalable active archives to go along with the scalable disk storage systems. Big Data does not live on disk alone. With a lower-cost, long-term media like tape and intelligent software, active archives can deliver large data sets as needed, when needed to the high-performance disk storage systems for the intense number crunching. And when the analysis is complete, an active archive can preserve the results for the future. That’s a no-brainer.
But another significant characteristic of Big Data is that much of the source data is fixed content—data that never changes. Consider transaction log files from a bank, satellite images in weather research and raw footage from the movie set. These files are fixed content from the moment they are created, and when handled properly will never be modified. As fixed content, these files should be preserved in an archive not just to conserve disk space, but because they are irreproducible.
So my advice to those responsible for those managing Big Data is to do what a number of Atempo customers are doing with their raw data today. First, archive all your raw data sets onto a low-cost media like tape as soon as they’re created, capturing and indexing the relevant metadata so you can search and retrieve it later. Make a second copy and send it offsite while you’re at it.
Then, if the data sets aren’t needed right away, remove them from the high-performance storage. When needed for analysis, the data sets can be retrieved quickly and easily from the active archive through the file system or via search. Your raw data sets will be secure and immediately available without crowding your expensive high-performance disk systems.
Archiving raw data sets is only one way that folks managing Big Data environments can reduce their Big Data management headaches. At almost every step in analytical workflows there are opportunities to manage data better through active archiving. By taking that first step of archiving raw data sets, you’ll get your Big Data strategy off on the right foot.
Note: Image from http://flowingdata.com/2010/08/17/stacked-area-shows-the-web-is-dead/