Archival- Data Identification, The Missing Link

To archive or not to archive, is no more the question in today's business world. With a spate of financial and other high profile scandals, the federal regulators of various countries now require long term retention /archival of most business data. Even so, not all are quite on the archival front.

Triangle Diagram

Let's start at the very beginning- a very good place to start. Structured data archival is a multi step process as shown above. The triangles represent different steps involved in the archival life cycle. The arrows show the process flow, the contact points between the triangles being a reference to the inter-relationship of different steps. The whole process starts with identifying what can be and needs archival, then we transform it into a forward compatible format and move the data to a suitable storage system to complete the archival process. While restoration of data is not part of archival per se, the fact that archived data needs easy restoring and readability is one of the "Duh" cases. Compliance and audit plays a dominant role in this. Of course one can argue that an active archive solution offers reliable, online and efficient access to the archived data. So, why bother to restore? Since the jury is still out on this, 'Restore' continues to be an integral part of data archival.

A survey of the available archival technologies shows that while IBM appears to lead the race with Optim integrated data management, the Active Archive Alliance is playing a key role in making the online data archive options available to a wider audience. To date, most of the focus has been on the "Transform," "Archive," "Restore" processes with very little means being available to identify what component of business data merits archival. The "Identify" step is largely a manual process, often a matter of subjective interpretation of business rules. To clarify, in a large global company determining what data is in active use at any given time and what can be moved to an offline archive system is highly challenging. Keeping a large volume of inactive data in the operational databases is extremely costly and yet often obligatory due to federal rules and regulations. Naturally it makes very good business sense to build a tool that classifies data into active and inactive thus making archival decisions simple. Technologically however, such data classification is a big challenge. While IBM WebSphere Content Discovery can assist in determining the relationship between different data sets, it does not help in finding out what data is stale or inactive. To my knowledge no commercially available tool exists in this space.

Why is it so hard to bell this cat? Can we not simply look at the metadata at the RDBMS level and peg what data are in use and what data are stale. The answer is both yes and no. The data elements in major RDBMS are stored at the DB block or page level. The metadata tracks read/write operations of the data elements at the storage unit level. So, yes tracking the metadata can yield very valuable information on whether a data set is stale or not. But, unfortunately, it's not as simple as it sounds. The read/write operations in RDBMS are not just limited to business queries, these are also triggered by maintenance operations like building/maintenance of indexes or even by a query that runs full table scan and so on.

So, how do we bell this cat? What do you have to say on this issue?