Comments
Richard Davies wrote: The UK has a good crop of technology pioneers in cloud computing - for example ElasticHosts, FlexiScale, Flexiant, OnApp - and also some strong government initiatives such as G-Cloud. We will have to see whether this kind of technical leadership converts into swift mass-market adoption or not.
Cloud Expo on Google News

SYS-CON.TV
Cloud Expo & Virtualization 2009 East
PLATINUM SPONSORS:
IBM
Smarter Business Solutions Through Dynamic Infrastructure
IBM
Smarter Insights: How the CIO Becomes a Hero Again
Microsoft
Windows Azure
GOLD SPONSORS:
Appsense
Why VDI?
CA
Maximizing the Business Value of Virtualization in Enterprise and Cloud Computing Environments
ExactTarget
Messaging in the Cloud - Email, SMS and Voice
Freedom OSS
Stairway to the Cloud
Sun
Sun's Incubation Platform: Helping Startups Serve the Enterprise
POWER PANELS:
Cloud Computing & Enterprise IT: Cost & Operational Benefits
How and Why is a Flexible IT Infrastructure the Key To the Future?
Click For 2008 West
Event Webcasts
Storing And Retrieving XML Content
Storing And Retrieving XML Content

At the outset XML separated the data from the metadata. This structural separation was intentional - it simultaneously allowed XML to be the logical evolution of a document, a new transaction medium, and the conversation engine that connects applications. Deciding how your application will use XML has implications you must consider when choosing a storage strategy. This article examines the storage and retrieval of XML, and the scalability and performance implications of the different approaches required when storing XML as a document, resolving it to a database, or maintaining its native format.

The way XML is used influences which storage and retrieval technology is appropriate for your application. These decisions, in turn, have an impact on scalability and performance. When XML is used as a mechanism for information reuse and formatting, it's philosophically a document; when used to exchange tagged data between systems, it's a transaction; and when used to convert and map data between two systems, it's a data conversion engine. These three different usage patterns demand significantly different approaches to data storage and management. Before describing how these usage patterns impact data storage and retrieval, it's important to understand the core strengths of XML - flexibility, extensibility, ease of use, and platform independence - and how these strengths have contributed to such substantially different usage patterns.

One of XML's primary benefits is its ability to separate the structure and tags from the content. This separation has spawned a number of evolutionary paths that are visible in the various XML standards initiatives currently under development. These initiatives represent usage patterns that present fundamental biases about how XML is used and how it should evolve. The biases can be categorized into three different philosophies:

  1. XML is about documents, reuse of information, and adaptable presentation: This means you must treat the XML as a whole unit of information. The collection of text, tags, and presentation are a bound set, and storing the information in its original format is a core requirement. Any modifications to the original format can potentially corrupt the information and the intent of the writer of the document.
  2. XML is the integration glue between databases and applications: It's a translation medium between different data representations. The evolutionary focus created by this usage is to extend XML and enhance its role as the quintessential translation medium.
  3. XML is a standard that facilitates the communication of data: Initially, existing tools, applications, and services will adopt XML as a gateway. However, the effort is to evolve XML so applications and services can be built from the bottom up to fully utilize the communications capabilities inherent in XML, and expand its use into new problem sets.

In this article we explore the impact each philosophy has on the storage and retrieval of XML content and shed further insight into which approach is best suited for your application.

Core Components of an XML Document
To discuss the storage impact of XML it's important to understand its core components. As a foundation for this discussion I've taken pertinent definitions directly from the XML 1.0 Specification on the W3C's Web site to ensure we start with the basic facts rather than the hype surrounding XML.

Definition of an XML Document
XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters that form both character data, and some of which form markup. Markup encodes a description of the document's storage layout and logical structure, and XML provides a mechanism to impose constraints on the storage layout and logical structure.

Definition of DTD
XML contains or points to markup declarations that provide a grammar for a class of documents. This grammar is known as a document type definition. The DTD can either point to an external subset (a special kind of external entity) containing markup declarations or contain the markup declarations directly in an internal subset, or both. The DTD for a document consists of both subsets taken together.

When considering how to store an XML document, it's important to understand the general role of the XML DTD and how it will likely be implemented within your specific application. Every XML document must have a DTD that can be contained within the XML; in this case, the XML is standalone or can reference an external DTD (referred to above as an external subset). Either way this DTD presents challenges from a storage perspective.

Indexing - The Impact on Storage and Retrieval
When considering an approach to storing XML content, the solution must be able to adapt to the indexing and query requirements of the application. For example, some applications need to access the entire XML document, while others may need to access only a specific data element within the XML document. More likely the application will need a combination of the two.

The index structure for a particular application defines the level of optimization achieved between the storage subsystem and the application. An index has pointers into the content store for quick access to the stored information. The method for creating these pointers varies significantly as does the read performance of the different index types and system costs needed to create the index. Clearly, the indexing and storage technology chosen must be efficient for the application accessing the XML objects. In addition, you must consider if all the information in the XML should be stored. For example, XML may include image files; if these aren't needed by the application, it may be wise to strip the images out prior to storage.

Storing XML As a Document
The document approach is based on the premise that the XML document and its accompanying information should be treated as a single entity, perhaps by referencing the object as a file name. By treating the XML along with all the DTD and XSL supporting files as a single entity, the application can resolve the document file name and maintain the references' integrity to any DTD and XSL for this specific instance. Let's review this in a practical example.

Assume the DTDs and XSL style-sheets for a series of XML documents are extremely similar in content - in fact, assume that 90% of the information stored will be duplicate information (a situation common when using standard forms such as legal documents and contracts). Regardless of whether the DTD was internal or external, it's necessary to keep a copy of it for each associated XML document, even though the same DTD is replicated in many of the XML documents within the same data store. This consumes additional storage space over and above the amount incurred by having the transaction in XML (see my article in last month's issue of XML-J, [Vol. 1, issue 7]). The structure stored in the document format is inherently inefficient since all the structures are text-based. Therefore, storing all the XML components in a single document significantly reduces the ability to optimize the storage subsystem.

In the document approach the only index values are those created at the time of filing (such as the file name). These values enable fast access to the entire document and its contents and are typically extracted from the document at the time the document is filed into the storage system. A number of issues with this process affect scalability and performance, including the fact that the indexing process occurs at the time the document is stored. This means if additional or different indexes are needed later, the entire data store must be traversed, and all the documents parsed and read to create the new index. In addition, the extraction process is usually quite CPU and memory intensive since parsing is inefficient and, as a result, encourages only the creation of critical lookup indexes. Finally, indexing on DTDs or changes in them would be next to impossible, further limiting the type of lookup indexes that can be created against the XML documents.

Generally, a lookup index restricts the ability to retrieve a return set, which is based on an open-ended filter, against the content. That is, an index structure doesn't typically support the ability to retrieve all documents within a specific date range unless the date was part of the index. As a result, an index speeds access to XML objects where the access criteria are well understood in advance. A major benefit of XML is that its structure can be easily changed and extended, and the decision to have an index based on specific data elements restricts the use of this XML attribute. The price for higher access performance and scalability is reduced flexibility.

Storing XML As a Database
Another approach to storing XML content is based on the premise that XML is the communication medium for transactional data and that the pertinent data required for storage should be transformed from XML into an appropriate Database Management System (DBMS). The challenge of applying DBMS technology to XML is that a database schema applies some rigidity, and any rigidity in the database schema limits the key features of XML - the ability to extend itself to accept new data elements and data structures and to allow changes to current data elements and structures.

While the approach of taking only the information that the DBMS can accommodate appears perfectly reasonable, it removes the extensibility benefit that's a core component of XML. Since XML DTDs identify the required data versus the optional and can be easily extended to support new data elements, building a structured database against one instance of a DTD limits flexibility. If a mapping function is implemented that maps XML to the database schema, this schema can't adapt to additional information from optional or new content in subsequent instances. If you're mapping content from multiple XML documents, there will clearly be a performance impact associated with maintaining these diverse format maps and performing the associated conversions. Also, converting or ignoring data elements means running the risk of losing that data and/or original syntax. Worse, if the syntax of the data element is changed even slightly, it impacts future information needs. While an often-discussed approach is to store the unused or unexpected information in a comment or memo field, this action eliminates from the database schema the availability of the data contained in the memo field.

In the database approach, the index values are mapped to the tables being populated. As information is transformed into the database, indexes are generated for quick access to the content. These indexes can be re-created as tables and searched as criteria evolve. However, these indexes are limited to the content maintained in the tables while the original XML content is no longer available for indexing. While this approach provides incredible flexibility in tuning the performance of the storage system and is by far the most mature approach, it also restricts the solution from leveraging XML's extensibility and flexibility capabilities.

Storing XML in Its Natural Format
In the near future, storage systems will be able to store XML in its natural format and manage the DTD and XSL data representation of the content. The tools used for this approach will maintain the flexible nature of XML while also providing the optimization needed for storage and retrieval. These tools will represent the context and relationships inherent in the tags, yet find an efficient mechanism to store the elements. While the technology that accomplishes these goals exists today, it's not broadly applicable yet because specifications for accessing the XML data within these systems is still not finalized or widely adopted. The solutions based on these technologies are subject to revision until a standard access mechanism evolves.

One example of a potential access mechanism standard is the XML Query Language (XQL) specification that enhances the data model for XML documents and provides a set of query operators on that data model. In addition, XQL has a query language that's similar to SQL in its ability to access information in databases. Unlike SQL, XQL is limited to operating on single documents or a fixed collection of documents. It can select whole documents or subtrees of documents that match conditions defined on document content and structure, and then construct new documents based on the resulting set. The XQL draft specification outlines the use of this query language for retrieving content from various types of XML documents, many of which are central to the storage problem.

There's some speculation as to whether or not these XML-specific solutions are really available today. Current solutions in the market tend to use object database technology to manage the storage and retrieval of XML. Proponents argue that object databases provide the answer to the XML storage and access problem since they leverage the strengths of XML and don't try to force fit them into a prestructured solution.

While an interesting premise (an XML database may be founded on this underlying object technology), two core tenets of object technology - inheritance and polymorphism - are not supported by XML; as a result, access of XML content in an object database is still inefficient.

From a storage perspective this representation of XML in an object model further increases the size of the content by a factor of three or more. In addition, these object-based solutions require significant system resources and haven't been able to scale to the level of relational database management systems. Since these solutions do offer some advantages for specific XML implementations, they have an important place in the market.

Conclusion
As you can see there isn't one right answer for storing and retrieving XML data. Several companies are creating XML storage solutions built from the ground up that promise major performance benefits. Given that most of the standards for querying and indexing into XML are still works in progress, it'll be some time before tools built on stable specifications are available. The draft specifications for XQL and Xpath show encouraging signs that significant progress is being made, but most of these efforts are still early works and represent the bleeding edge.

What should you do to implement a solution today? The first step is to clearly define the problem you need to solve with XML and identify which, if any, of the above philosophies applies. Without this understanding it's impossible to consider the strengths and weaknesses of your storage options - storing XML as a document, resolving it to a database, or implementing a solution more specific to your unique needs. Next, you must identify the important XML characteristics of your application to ensure that the chosen storage technology doesn't impact the XML characteristic your application needs most. Is the primary focus human readability or data translation? Is the nature inherently fluid or rigid? How critical is the content to your business in its original format? Considering the solution and characteristics along with the relative strengths and weaknesses of the approaches should help shed light on which approach suits your requirements.

If your requirements are driven by both the need for efficient storage and maintaining the flexibility of XML, consider those solutions that focus on XML's natural data schema. Although the specifications for dealing with XML are still in draft form, the characteristics of XML and the need to efficiently store, traverse, and access XML content point to a tree structure as the natural form for XML storage. Many proof points support the idea that a tree structure is the natural form for XML. This is reinforced in part by the DOM representation of XML, the way current tools interact with XML, and the relationship of elements and subelements in a DTD. If this is true, the evolution of an XML data store will have many similarities to systems that manage and traverse information in tree data structures. By focusing on this approach you'll be able to minimize the impact of changing specifications on your final solution.

In Part 3 of this series I'll discuss the semantic issues around resolving content across multiple XML transactions. Although XML defines the translation definition associated with a given transaction, there's no way to enforce the business context associated with the data within the transaction. The use of namespace, numeric values, and timestamps all create context-specific issues when looking across multiple transactions or business entities. If you'd like to discuss a particular aspect of this or any other topic, please feel free to e-mail me at kpatel@tilion.com.

References

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

Latest Cloud Developer Stories
Swisscom, the Swiss telecom, is going into the cloud business. Its subsidiary Swisscom IT Services AG has signed up with Red Hat as a Certified Cloud Provider and launched a public cloud Infrastructure-as-a-Service (IaaS) cloud targeting enterprise-class customers primarily in ...
Apache Deltacloud, the Red Hat-contributed ReSTful API that abstracts differences between clouds so services on any cloud can be managed – provided of course there’s a driver – has graduated from the Apache Foundation’s incubator and is now a full-fledged Top-Level Project (TLP)....
In a surprise move on Tuesday, January 10, Oracle wheeled out its Big Data Appliance. That’s the one it said in October would be ready sometime in the first half. Only nobody believed it meant early in the first half. Heck, it’s not even clear anybody thought Oracle could make ...
Rackspace Hosting, the service leader in cloud computing, on Thursday announced its acquisition of SharePoint911, an industry leader in SharePoint consulting, training, and "JumpStart" services within SharePoint. The unification of both companies provides capabilities to deliver ...
CloudLinux, Inc., on Thursday released CafeFS 3, a virtualized file system for shared hosters that cages each customer within its own virtualized file system. CageFS becomes part of CloudLinux OS at no additional charge. CloudLinux OS, the only commercially-supported Linux OS m...
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

SYS-CON Featured Whitepapers
ADS BY GOOGLE

Breaking Cloud Computing News
AMD (NYSE: AMD) announced today that industry veteran John Byrne has been appointed senior vice pres...