|
Comments
Did you read today's front page stories & breaking news?
SYS-CON.TV
|
XML Protocols Storing And Retrieving XML Content
Storing And Retrieving XML Content
By: Ketan Petal
Jan. 8, 2001 12:00 AM
At the outset XML separated the data from the metadata. This structural separation was intentional - it simultaneously allowed XML to be the logical evolution of a document, a new transaction medium, and the conversation engine that connects applications. Deciding how your application will use XML has implications you must consider when choosing a storage strategy. This article examines the storage and retrieval of XML, and the scalability and performance implications of the different approaches required when storing XML as a document, resolving it to a database, or maintaining its native format. The way XML is used influences which storage and retrieval technology is appropriate for your application. These decisions, in turn, have an impact on scalability and performance. When XML is used as a mechanism for information reuse and formatting, it's philosophically a document; when used to exchange tagged data between systems, it's a transaction; and when used to convert and map data between two systems, it's a data conversion engine. These three different usage patterns demand significantly different approaches to data storage and management. Before describing how these usage patterns impact data storage and retrieval, it's important to understand the core strengths of XML - flexibility, extensibility, ease of use, and platform independence - and how these strengths have contributed to such substantially different usage patterns. One of XML's primary benefits is its ability to separate the structure and tags from the content. This separation has spawned a number of evolutionary paths that are visible in the various XML standards initiatives currently under development. These initiatives represent usage patterns that present fundamental biases about how XML is used and how it should evolve. The biases can be categorized into three different philosophies:
In this article we explore the impact each philosophy has on the storage and retrieval of XML content and shed further insight into which approach is best suited for your application.
Core Components of an
XML Document
Definition of an XML Document
Definition of DTD When considering how to store an XML document, it's important to understand the general role of the XML DTD and how it will likely be implemented within your specific application. Every XML document must have a DTD that can be contained within the XML; in this case, the XML is standalone or can reference an external DTD (referred to above as an external subset). Either way this DTD presents challenges from a storage perspective.
Indexing - The Impact on Storage and Retrieval The index structure for a particular application defines the level of optimization achieved between the storage subsystem and the application. An index has pointers into the content store for quick access to the stored information. The method for creating these pointers varies significantly as does the read performance of the different index types and system costs needed to create the index. Clearly, the indexing and storage technology chosen must be efficient for the application accessing the XML objects. In addition, you must consider if all the information in the XML should be stored. For example, XML may include image files; if these aren't needed by the application, it may be wise to strip the images out prior to storage.
Storing XML As a Document Assume the DTDs and XSL style-sheets for a series of XML documents are extremely similar in content - in fact, assume that 90% of the information stored will be duplicate information (a situation common when using standard forms such as legal documents and contracts). Regardless of whether the DTD was internal or external, it's necessary to keep a copy of it for each associated XML document, even though the same DTD is replicated in many of the XML documents within the same data store. This consumes additional storage space over and above the amount incurred by having the transaction in XML (see my article in last month's issue of XML-J, [Vol. 1, issue 7]). The structure stored in the document format is inherently inefficient since all the structures are text-based. Therefore, storing all the XML components in a single document significantly reduces the ability to optimize the storage subsystem. In the document approach the only index values are those created at the time of filing (such as the file name). These values enable fast access to the entire document and its contents and are typically extracted from the document at the time the document is filed into the storage system. A number of issues with this process affect scalability and performance, including the fact that the indexing process occurs at the time the document is stored. This means if additional or different indexes are needed later, the entire data store must be traversed, and all the documents parsed and read to create the new index. In addition, the extraction process is usually quite CPU and memory intensive since parsing is inefficient and, as a result, encourages only the creation of critical lookup indexes. Finally, indexing on DTDs or changes in them would be next to impossible, further limiting the type of lookup indexes that can be created against the XML documents. Generally, a lookup index restricts the ability to retrieve a return set, which is based on an open-ended filter, against the content. That is, an index structure doesn't typically support the ability to retrieve all documents within a specific date range unless the date was part of the index. As a result, an index speeds access to XML objects where the access criteria are well understood in advance. A major benefit of XML is that its structure can be easily changed and extended, and the decision to have an index based on specific data elements restricts the use of this XML attribute. The price for higher access performance and scalability is reduced flexibility.
Storing XML As a Database While the approach of taking only the information that the DBMS can accommodate appears perfectly reasonable, it removes the extensibility benefit that's a core component of XML. Since XML DTDs identify the required data versus the optional and can be easily extended to support new data elements, building a structured database against one instance of a DTD limits flexibility. If a mapping function is implemented that maps XML to the database schema, this schema can't adapt to additional information from optional or new content in subsequent instances. If you're mapping content from multiple XML documents, there will clearly be a performance impact associated with maintaining these diverse format maps and performing the associated conversions. Also, converting or ignoring data elements means running the risk of losing that data and/or original syntax. Worse, if the syntax of the data element is changed even slightly, it impacts future information needs. While an often-discussed approach is to store the unused or unexpected information in a comment or memo field, this action eliminates from the database schema the availability of the data contained in the memo field. In the database approach, the index values are mapped to the tables being populated. As information is transformed into the database, indexes are generated for quick access to the content. These indexes can be re-created as tables and searched as criteria evolve. However, these indexes are limited to the content maintained in the tables while the original XML content is no longer available for indexing. While this approach provides incredible flexibility in tuning the performance of the storage system and is by far the most mature approach, it also restricts the solution from leveraging XML's extensibility and flexibility capabilities.
Storing XML in Its
Natural Format One example of a potential access mechanism standard is the XML Query Language (XQL) specification that enhances the data model for XML documents and provides a set of query operators on that data model. In addition, XQL has a query language that's similar to SQL in its ability to access information in databases. Unlike SQL, XQL is limited to operating on single documents or a fixed collection of documents. It can select whole documents or subtrees of documents that match conditions defined on document content and structure, and then construct new documents based on the resulting set. The XQL draft specification outlines the use of this query language for retrieving content from various types of XML documents, many of which are central to the storage problem. There's some speculation as to whether or not these XML-specific solutions are really available today. Current solutions in the market tend to use object database technology to manage the storage and retrieval of XML. Proponents argue that object databases provide the answer to the XML storage and access problem since they leverage the strengths of XML and don't try to force fit them into a prestructured solution. While an interesting premise (an XML database may be founded on this underlying object technology), two core tenets of object technology - inheritance and polymorphism - are not supported by XML; as a result, access of XML content in an object database is still inefficient. From a storage perspective this representation of XML in an object model further increases the size of the content by a factor of three or more. In addition, these object-based solutions require significant system resources and haven't been able to scale to the level of relational database management systems. Since these solutions do offer some advantages for specific XML implementations, they have an important place in the market.
Conclusion What should you do to implement a solution today? The first step is to clearly define the problem you need to solve with XML and identify which, if any, of the above philosophies applies. Without this understanding it's impossible to consider the strengths and weaknesses of your storage options - storing XML as a document, resolving it to a database, or implementing a solution more specific to your unique needs. Next, you must identify the important XML characteristics of your application to ensure that the chosen storage technology doesn't impact the XML characteristic your application needs most. Is the primary focus human readability or data translation? Is the nature inherently fluid or rigid? How critical is the content to your business in its original format? Considering the solution and characteristics along with the relative strengths and weaknesses of the approaches should help shed light on which approach suits your requirements. If your requirements are driven by both the need for efficient storage and maintaining the flexibility of XML, consider those solutions that focus on XML's natural data schema. Although the specifications for dealing with XML are still in draft form, the characteristics of XML and the need to efficiently store, traverse, and access XML content point to a tree structure as the natural form for XML storage. Many proof points support the idea that a tree structure is the natural form for XML. This is reinforced in part by the DOM representation of XML, the way current tools interact with XML, and the relationship of elements and subelements in a DTD. If this is true, the evolution of an XML data store will have many similarities to systems that manage and traverse information in tree data structures. By focusing on this approach you'll be able to minimize the impact of changing specifications on your final solution. In Part 3 of this series I'll discuss the semantic issues around resolving content across multiple XML transactions. Although XML defines the translation definition associated with a given transaction, there's no way to enforce the business context associated with the data within the transaction. The use of namespace, numeric values, and timestamps all create context-specific issues when looking across multiple transactions or business entities. If you'd like to discuss a particular aspect of this or any other topic, please feel free to e-mail me at kpatel@tilion.com.
References
Reader Feedback: Page 1 of 1
Latest Cloud Developer Stories
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
|
SYS-CON Featured Whitepapers
Most Read This Week
Breaking Cloud Computing News
|
|||||||||||||||||||||||||||||||||||||||||||||||||