|
Comments
Did you read today's front page stories & breaking news?
SYS-CON.TV
|
XML Protocols Can One Size Fit All?
Can One Size Fit All?
By: Dare Obasanjo
Oct. 3, 2003 12:00 AM
Traditionally, APIs for processing XML have been categorized according to whether they're designed for processing entire XML documents loaded in memory, such as the W3C DOM, or for processing XML in a streaming, forward-only fashion, such as SAX. However, these divisions do not fully represent the various classes of APIs for processing XML.
In a recent article entitled "A Survey of APIs and Techniques
for Processing XML," I describe six primary methodologies for
processing XML. This list highlights that the range of considerations when choosing an API or technique for processing XML extends beyond forward-only access over XML streams versus random access over XML documents stored in memory. Other considerations include whether the XML being processed is used to represent semi-structured documents versus rigidly structured data, whether the XML is considered to be strongly or weakly typed, and ease of use of the API. The purpose of this article is to explore whether a single API could be designed that satisfies the various needs that warrant the existence of six different categories of technologies for processing XML.
Rigidly Structured Data and Semi-Structured Documents Software applications are usually the primary consumers of XML documents that represent rigidly structured data. Such XML documents usually have content that is meant primarily for machine processing that is labeled with markup targeted for human consumption. XML configuration files, log files, and relational database dumps are examples of rigidly structured data that are meant primarily for machine processing. The markup in these documents is mainly of use to human readers who are either editing or debugging an XML application. Such XML documents typically comprise elements and attributes where only the deepest subelements - the leaf nodes - contain character data. Although XML considers the order of elements to be significant, the order of sibling elements in such documents is often not important to the semantics of the document (e.g., the order of the rows in a database dump is often not significant). The following is an example of an XML document representing rigidly structured data:
<items> Human readers are usually the primary consumers of semi-structured XML documents. In this case, the XML markup assists software applications to process the data. Web pages and business documents are examples of semi-structured documents that are meant primarily for human consumption. Their markup is mainly of use to programs that are processing or displaying the information within the documents. Such XML documents typically comprise elements and attributes where character data appears alongside subelements, and character data is not confined to the leaf nodes. The interleaving of character data with subelements is often described as mixed content. The order of elements in semi-structured documents is often significant (e.g., the order of chapter elements within a book element matters). Features such as entities, processing instructions, and comments are more likely to be used in semi-structured documents to aid authors and readers of the XML. The following is an example of a typical semi-structured XML document:
<p xmlns="http://www.w3.org/1999/xhtml/"> If customer is not In reality, many uses of XML fall somewhere in the middle, where there is an island of rigid structure within a semi-structured document or an area with "open content" in a rigidly structured document. XML easily accommodates these scenarios because the choice of which model of document to exchange is not mutually exclusive.
The Relationship Between Data Typing and XML Usage Patterns Consumers of such XML documents that contain rigidly structured data often want to consume the documents as strongly typed XML. Specifically, such applications tend to map the elements, attributes, and character data within the XML document to programming language primitives and data structures so that they can better perform operations on them. This mapping is usually done using either an XML Schema or a mapping language. Listing 1 is an example of a W3C XML Schema document that describes the strongly typed view of the XML document. Consumers of semi-structured XML documents typically want to consume the documents as weakly typed or untyped content presented as an XML data model. In such cases XML APIs that emphasize an XML-centric data model, such as DOM and SAX, are used to process the document. An XML-centric view of such semi-structured documents is preferable to an object-centric view because such documents typically use features peculiar to XML, such as mixed content, processing instructions, and the order of occurrence of elements within the document is significant.
Choosing a Data Model There are already a number of existing abstractions for XML documents, including the XML DOM, the XML infoset, and the XPath 1.0 data model. Of these the XPath 1.0 data model best meets the requirements set forth in the previous paragraph. The XPath 1.0 data model provides a simple and consistent view of an XML document that is loosely coupled to the text-based nature of the XML 1.0 recommendation. The fact that the XPath 1.0 data model ignores certain aspects of the XML 1.0 recommendation makes it easier to map other domain models to the XPath 1.0 data model. For instance, information such as which quotation characters are used in an attribute or whether character data was directly entered or represented as an entity is not directly exposed in the XPath 1.0 data model. Thus, when exposing a relational database, file system, or in-memory object graph as XML it is easier to do so if the API doesn't require you to expose information that is only pertinent to XML text documents. Examples of such "virtual XML" views of relational and object-oriented data based on the XPath 1.0 data model are Microsoft's SQLXML and the ObjectXPathNavigator on MSDN, respectively. There is one limitation of the XPath 1.0 data model that makes it less than ideal for use in representing rigidly structured data within XML documents: the lack of support for strong typing. The XQuery and XPath 2.0 data model is the next iteration of the XPath 1.0 data model. The data model is the XPath 1.0 data model with the addition that the data types associated with elements and attributes can be identified using an expanded name (i.e., the xs:QName type). The ability to identify the data type of nodes via the namespace URI and a local name (i.e., an expanded name) provides a loosely coupled mechanism for supporting W3C XML Schema data types and potentially any other type system in which individual types can be identified by an expanded name. Thus, we have arrived at the XQuery/XPath 2.0 data model as the data model suitable for an API that is meant for processing both rigidly structured XML data and semi-structured XML documents.
A Single Model for Forward-Only Access to XML Listing 2 is an example of using the pull-based XmlReader class in the .NET Framework to obtain the artist name and title of the first compact disc in an items element (Listings 2-7 can be found at www.sys-con/xml/sourcec.cfm). In the same article I pointed out that on close inspection a pull-model parser is a cursor that happens to be restricted to being able only to move forward and not back. Listing 3 is an example that utilizes the .NET Framework's XPathNavigator class to obtain the artist name and title of the first compact disc in an items element From the examples in Listings 2 and 3 it doesn't seem that there is much difference between accessing the contents of an XML document using a cursor- model API and using a pull-based API. But looks can be deceiving. In simple cases there is not much difference between the two, but it does get slightly more difficult in complex cases. The primary programming idiom when using a pull-based parser is to create a loop that continually reads from the XML document until the end of the document is reached and to act solely upon items of interest as they are seen. The same effect can be achieved using a traditional cursor-model API as shown in Listing 4. The output of both DumpTree() methods when passed one of the XML fragments from earlier in the article is shown in Listing 5. From Listing 5 it can be seen that a cursor-model API can be used to walk all the nodes in an XML document in document order in much the same way as a push- or pull-based API. However, in the example above, the code using the .NET Framework's cursor-based XPathNavigator class is more cumbersome than equivalent code using the XmlReader class. This has less to do with the nature of cursor-based APIs and more to do with the fact that the XPathNavigator class does not have helper methods that make it friendly towards traversing nodes in document order. To make the .NET Framework's XPathNavigator more suitable as an API for pull-based processing of XML, you could introduce a base class called ForwardOnlyPathNavigator, which would possess only the forward-only access methods from the XPathNavigator and possibly an additional MoveToNextInDocumentOrder() method that would make it equivalent to most pull-based APIs. This would then unify the streaming, forward-only access model for XML documents with the cursor model.
A Single Model for Random Access to XML Listing 6 is an example of using the XmlDocument class in the .NET Framework to obtain the artist name and title of the first compact disc in an items element. Listing 6 shows a common idiom when accessing XML through a tree-model API; the nodes of interest are requested through a query mechanism, then processed as needed. A similar usage pattern is evident in cursor-based APIs as Listing 7 using the .NET Framework's XPathNavigator class shows. From these examples it can be seen that a cursor-model API is a satisfactory access mechanism for processing XML documents in-memory.
Strongly Typed XML You would then be justified in expecting that such consumers would be uninterested in XML APIs. However, this is not the case. A number of people have seen the growing benefits of being able to access objects as XML infosets when necessary since it gives them access to a wide range of technologies for processing XML such as rich queries using XPath (this is the basis of the ObjectXPathNavigator described in an Extreme XML column on MSDN). In such cases, a cursor-model API that provides an XML view of an object graph turns out to be quite beneficial. This approach was taken by the aforementioned ObjectXPathNavigator as well as BEA's XML Beans technology. In some cases this means the ability to nest cursors is important. For instance, many XML Schemas written using the W3C XML Schema Definition Language (XSD) use the wildcards (xs:any and xs:anyAttribute) to enable extensibility of the XML messages being sent. This often leads to some parts of the document being strongly typed while others are untyped. The XmlSerializer in the .NET Framework maps such untyped content to one or more instances of the XmlNode class. Given that an instance of XmlNode can itself provide a cursor over its contents, the cursor over an object that contains one or more XmlNode objects as fields or properties needs to know how to handle nested items that provide their own XML cursors.
Conclusion The primary criticism of such an API is that it would be a compromise across widely differing usage scenarios and thus may not be optimized for the specifics of a given scenario. If such an API was designed and intended to replace existing models for processing XML, it would have to be carefully designed not to have too little functionality by being focused on the lowest common denominator nor too much by trying to have all the functionality of existing API models, thus making it bloated and difficult to use. It will be interesting to see where the future takes us.
Reference Reader Feedback: Page 1 of 1
Your Feedback
Latest Cloud Developer Stories
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
|
SYS-CON Featured Whitepapers
Most Read This Week
Breaking Cloud Computing News
|
||||||||||||||||||||||||||||||||||||||||||||||||||||