Building an Active "Forever" Archive -- Metadata evolution and the problem of what to keep?

Readers of the Active Archive Alliance site, and those who are dealing first hand with the explosion of data, are faced with many of the same problems over and over. Key questions arise in managing data, or more importantly, in cost-effectively enabling users to have easy access to data as it grows. Questions such as:

  • How much inactive data is sitting on active production disk?

-- Simply adding more disk is not a long-term strategy

  • As production disk grows, backup windows also grow.

-- Unless archive and backup solutions are aligned, bottlenecks occur. Backup and restore times become unmanageable.

  • Growing storage creates silos of data, often incompatible silos.

-- Collaboration becomes difficult, often impractical. Management becomes unwieldy.

  • Unchecked data growth = increased operating costs, power, space, cooling issues.

The premise of an Active Archive is to keep primary, production storage small, or as close to a constant as possible. Increase in data is absorbed in lower-cost tiers through an HSM or Tiered Storage Virtualization strategy. Thus, each of the problems noted above becomes manageable.

But often the problem is that users are unwilling, or unable to determine what portion of their data is truly able to be migrated to lower tiers. They fear that even if their data has not been accessed for a significant period of time it might be too difficult to access if it is not immediately available on production disk – which brings us back to the problems noted above.

A recent conversation with an IT director of a major pharmaceutical company illustrated the dilemma:   “We spent a lot of money on an ‘archive’ solution, which sits at 10 percent utilization,” he said. “Even though it works very well, and there is little latency in getting data back into production environments, the user community can’t agree on which portion of the data should be archived. And so they archive nothing.”

An active archive solution minimizes this, because the content is always available all the time. But even in this there is the problem of how to determine what should be kept, what tier the data should live on, and how to ensure that it will be around forever.

Metadata is the starting point... Metadata evolution is the key.

At the heart of this problem is the difficulty in categorizing data; to know whether it is something to keep forever, to throw away, to keep in active storage or a faster tier, or whether it can move downstream. Add to the problem that when thinking of true archives, the indexing and management scheme needs to take into account that metadata schemas will change over time. Metadata evolves, just as language does. New data types will emerge, and new use cases will be added that must be accommodated such that a query into the archive pulls back all relevant content, not just a subset of the content. Only by employing an indexing scheme that can harvest metadata actively and automatically and do so in the face of constantly evolving metadata can an active archive be persistent for the long haul.

A true active archive is one that can be truly alive for decades... or until the data is truly no longer relevant. By decoupling production disk from the archive, we get hardware independence such that a refresh of back-end infrastructure does not limit the users’ ability to access data, no matter what platform may appear in the future. In the same way, archive schemes must accommodate the ability for metadata to retroactively evolve. In this way, not only are the files physically available but the archive itself is fresh and users can easily find the content they seek. New or old, the archive remains a vital asset, and the problems of isolated data and underutilized or expensive infrastructure can be minimized.