Managing hierarchical data: A look at XML repositories
With the proliferation of XML as a new common data format, the problem of managing XML documents has become more critical. New technologies are now available that allow organizations to better manage their information as XML documents. In this article, we'll examine the technology of XML repositories and see how they help drive the future of extensible shared data.
An XML repository is a system for storing and retrieving XML data. This data is usually in the form of XML documents and their associated Document Type Definitions (DTDs) or XML Schemas. Because XML data lends itself to a hierarchical structure rather than a relational structure, it may be difficult to store XML data in traditional relational database systems. The repository itself may be a relational database system, but it is more likely to be a custom storage system built exclusively for XML (or hierarchical) data.
The data storage method will vary depending on the specific system being used. The process of storing and retrieving data may also vary. Data can be stored and retrieved using a key-based indexing system or a query-based retrieval system.
Finally, XML repositories may use a variety of access methods. Some systems use a proprietary API based on COM, CORBA, or Enterprise JavaBeans (EJB), while others use a more open ODBC standard. Most repositories provide good support for network access.
Storing and retrieving data
The process of storing XML data consists of two tasks: adding a new XML document to the repository and updating an existing document. Removing a document from the repository is considered a specialized example of updating an existing document.
Because XML data is not based on a traditional relational model, implementing XML repositories using such databases can be complex and cumbersome. For example, every level of XML hierarchy would require a new relational table. As your XML documents become more complex, your relational database does as well.
Storage systems that are built around a hierarchical model will more easily accept XML data and will do so without having to perform relational and indexing gymnastics as they would with a relational model. Hierarchical systems also offer the added benefit of allowing the use of XQL and XPath expressions for accessing whole and partial documents.
Retrieving XML data
The method used to retrieve XML documents is related to the storage method. For relational systems, this will usually be through SQL or stored procedures. These methods have the disadvantage of accessing and returning data as a relational set rather than as an XML hierarchical structure.
Hierarchical systems will usually provide an XQL or XPath method for accessing XML data. These technologies more accurately reflect the type of data queries made against XML data. They also provide the data in a hierarchical format.
Indexing and validating data
When storing data in relational systems, an external primary key may be attached to the XML document for maintaining primary document keys. The data storage and retrieval process uses these keys to identify which document is being stored or retrieved. More advanced systems extract a primary key from an XML element or attribute.
Indexes on data stored in relational tables are based on a single table (or single hierarchy level). Hierarchical systems allow you to address a primary key as an element or attribute, as well, but they also let you create indexes at different levels based on data within the hierarchy.
One of the most important aspects of XML documents is the option of data validation. Using a variety of technologies, including DTDs and Schemas, XML parsers are able to determine whether an XML document meets certain data standards. Because repositories are able to understand a DTD or XML Schema, they can provide validation as data is stored and updated.
Choosing a relationship
As XML documents continue to become more common, organizations will need to create a repository for managing hierarchical data. These repositories will offer new technology for storing, accessing, and optimizing XML documents
By Brian Schaffner
October 15, 2001