Comments
Richard Davies wrote: The UK has a good crop of technology pioneers in cloud computing - for example ElasticHosts, FlexiScale, Flexiant, OnApp - and also some strong government initiatives such as G-Cloud. We will have to see whether this kind of technical leadership converts into swift mass-market adoption or not.
Cloud Expo on Google News

SYS-CON.TV
Cloud Expo & Virtualization 2009 East
PLATINUM SPONSORS:
IBM
Smarter Business Solutions Through Dynamic Infrastructure
IBM
Smarter Insights: How the CIO Becomes a Hero Again
Microsoft
Windows Azure
GOLD SPONSORS:
Appsense
Why VDI?
CA
Maximizing the Business Value of Virtualization in Enterprise and Cloud Computing Environments
ExactTarget
Messaging in the Cloud - Email, SMS and Voice
Freedom OSS
Stairway to the Cloud
Sun
Sun's Incubation Platform: Helping Startups Serve the Enterprise
POWER PANELS:
Cloud Computing & Enterprise IT: Cost & Operational Benefits
How and Why is a Flexible IT Infrastructure the Key To the Future?
Click For 2008 West
Event Webcasts
Manipulating And Storing XML
Manipulating And Storing XML

XML is a metalanguage that's used to describe a language grammar. Documents that comply with a grammar formulated in XML use tags to distinguish between the actual content and its semantically relevant markup. XML defines abstract semantics in contrast to operational ones (which can be derived from them). An example of the latter is the presentation-oriented HTML.

XML documents are well suited for publication in different media such as books, CDs and, of course, the Internet. A second and, at present, more important usage is applying XML to Enterprise Application Integration (EAI): messages between applications running different tasks (sometimes even in different companies) are encoded using XML.

XML, a standard for data exchange and document markup, is infusing the discussion about appropriate storage mechanisms with new momentum. After all manipulating XML in the main memory isn't sufficient; you'll want to preserve your XML instances over time.

This article uses some Java programming examples to explore different methods of dealing with persistent XML in the form of files, and explains what considerations are relevant when choosing a database for XML storage.

An Overview
Developed for use on the Internet, XML is a simple standard that sets itself apart from the complex Standard Generalized Markup Language (SGML) ISO standard. Unlike HTML it can be easily evaluated by programs as well as by people. XML's wide acceptance in the industry, coupled with its interplay with Java, guarantees it an important role in the development of cross-platform solutions.

Currently XML is used primarily in the areas of messaging and content management. While the content management field has historically been occupied by SGML solutions, XML is being compared to technologies such as EDI and CORBA when it comes to message transmission. The first case involves using XML's advantages in the transmission of corporate data; the second places the interprocess communication of application solutions at the forefront. However, it would be wrong to describe XML as true middleware in the same sense as CORBA.

With both applications small XML streams are frequently regarded as a complete set or entity (for example, datum representation in XML - an invoice). However, in the field of content management large units are considered (XML documents, such as the complete maintenance instructions for an aircraft). The size of these units makes it necessary to process XML documents in parts only. This involves making use of their underlying structure.

It's important to remember that XML data and documents follow the same rules. It's only by means of the particular instance and its application domain that people come up with the aforementioned distinction.

In-Depth Look
As highlighted at the beginning of this article, XML can be used to formulate grammars. This type of specification can be made either in the form of a Document Type Definition (DTD, made popular by SGML) or through the application of one of the various XML Schema languages. Listing 1 shows the grammar for an invoice as an XML Schema.

Easy Ways to Programmatic Persistence
The advantage of the recently developed schema languages is that they also comply with XML rules. Not only do they permit the definition of data types, they also make it possible to specify inheritance hierarchies. This way it's easy to derive classes of objects from the schemas and find the appropriate models to make them persistent. Here we must clearly distinguish between two approaches: the implementation of specialized serialization mechanisms for individual instances and generic access through the Document Object Model (DOM).

XML instances form complex and dynamic hierarchical structures. They're easily transferred into object-oriented terminology and vice versa. For example, using a specialized implementation guarantees fast execution, for example, to generate XML files from Java. Listing 2 illustrates this for the class "invoice.. The disadvantage of this method is that the xmlWrite (or equivalent) method must be implemented anew for each and every class of objects that an application requires. Plus, reprogramming is necessary whenever the schema is modified and vice versa.

However, with the appropriate tools classes can be generated directly as source code from both DTDs and XML Schemas. Though most of these tools are still in the development phase, they offer simple ways to generate XML instances from objects and conversely. The more popular utilities in this area are Castor, a free implementation, and Breeze XML Studio (from Breeze Factor) in the commercial area. Sun created a Java Specification Request (JSR-031) to describe "a facility for compiling an XML Schema into one or more Java classes that can parse, generate, and validate documents that follow the schema." A series of articles on IBM's developerWorks deals with data binding for Java.

Data binding in this sense refers to methods that allow the marshaling/unmarshaling of Java objects in the form of XML files. This is done by means of accessor and mutator methods that affect the underlying XML document. A direct map of element/attribute names of the XML document is established; that is, if the XML document has an element "owner", the corresponding Java class has the methods "setOwner" and "getOwner".

The second approach is clearly more generic. The XML instance is transferred into a DOM object hierarchy. As a result, the classes used remain identical for each XML instance. On the other hand, the object structure can vary, as when different schemas form the basis for the structure. Figure 1 shows the DOM for an invoice XML instance. Manipulation of the XML instance on disk is done by means of a tree representation of the XML document.

Based on performance and storage use, advantages can be gained by creating specialized classes for XML data. Unlike XML documents, XML data tends to be small and can be easily stored as a complete set in a system's main memory. For XML documents, it makes sense to use a generic approach because they should also be processible in fragments (a book may need to be versioned in individual chapters or sections). In addition, using a generic approach means not having to rely on specific schemas. This way the algorithms apply to all XML instances independently.

Persisting XML in Databases
So far this article has dealt with manipulating and preserving XML in a programmatic way using object-oriented techniques, Java, and the file system. However, you shouldn't forget that today's business applications require extensive searching, and access- and version-control facilities. Whereas using the file system is clearly an easy method, it's the lack of these features that makes the file system an inferior storage solution for real-life XML instances. Search for an appropriate database-driven solution instead. As of today, there are systems based on well-known relational or object-oriented technology and systems that claim to store XML natively.

The basic differences between object-oriented and relational databases are easily summarized:

  • Object orientation: There's a close connection between data and functions. The object-oriented model sets itself apart by exhibiting a high degree of flexibility when it comes to describing data types and representing relationships.
  • Relational model: It's based on two-dimensional tables. Data is stored in columns and rows with each entry appearing as a row. Relationships between data in different tables are established through the comparison of key values.

Since objects can't be stored as entities in an RDBMS, a paradigm shift from the object-oriented approach to the relational schema is necessary. This shift concerns the operations that are involved in both writing and constructing objects. The object identity, implemented transparently by the ODBMS, must be abandoned in an RDBMS in favor of the database identity (via primary keys). When a decomposed object is accessed, several instances may be reconstructed unintentionally unless appropriate checks are run first. When objects are stored, all references to embedded objects must be converted into foreign keys. Thus it's always the program's responsibility to ensure integrity. In addition, to avoid run-time errors pointers must be initialized manually when an object is loaded.

Three user classes would typically be created to represent an Invoice, namely invoice, address, and item. If you want to store invoice "objects" efficiently, five tables are necessary in an RDBMS: Invoice, Invoice2Address (n:m), Invoice2Item (n:m), Address, and Item. In comparison, a generic minimal DOM requires only one class "Node" with just four members: name, value, type, and number of children. Of course this wouldn't allow for all XML language elements. For example, a complete model covering document type, entity, and processing instructions would require approximately 15 classes. Objects from a minimal DOM could be stored in an RDBMS using two tables: Node and Node2Node (n:m).

Although administering two to five tables seems like a simple task for an RDBMS, the difficulty lies in the detail: the operation that's necessary to return the child nodes of a root node becomes overly complex with deeply structured XML instances.

Information about the structure of the objects is available only from the client. A query to an RDBMS returns a result set in the form of a table that has to be traversed sequentially to derive the appropriate reactions for each entry. Consequently, the client has to actively request new nodes on a continuous basis, which adversely affects performance. While this is not a problem with a simple Invoice-type document, it clearly becomes one when the XML tree depth rises.

With an RDBMS loading and manipulating objects it's necessary to use sophisticated locking schemes to exclude competing modifications. It's difficult to establish efficient locking that guarantees scalability in multiuser environments. Furthermore, the number of accesses to logically related objects increases with the number of users, and in turn raises the complexity of the locking schemes. This can cause significant degradation of a system's overall performance. In addition, data consistency must be ensured via triggers and constraints. Though this works well on the server side, the underlying structure remains hidden from the server and must be introduced on the client, adding more overhead.

The content of an XML element isn't limited to a specific length, further reducing efficiency. To accommodate this aspect, the structure of a relational table has to be selected liberally (and thus inefficiently).

Storing XML instances (or corresponding fragments) in binary large objects (BLOBs) is not a solution to the problems of the relational approach. Although this reduces navigational expenses and circumvents the problems that arise with variable structures/element sizes of XML instances, it results in a loss of flexibility. At best when using BLOBs, information contained in the XML instance can be found only with a full-text search due to the loss of structural characteristics.

Thus, when storing XML in relational databases you shouldn't try to store a generic object model. Instead, make sure the RDBMS can do what it does best - work with sets. Restoring the structure of an XML instance is best left to the application logic.

Of course, the previous paragraphs didn't cover native XML storage. But then again what does native XML storage mean anyway? Frankly, the physical storage layout shouldn't be relevant to the user. Instead the application domain dictates the requirements for the persistence mechanism. Obviously, "native" XML databases try to set themselves apart by supporting a large number of XML-related standards (or those initiatives that seem likely to succeed on their way to becoming a standard), for example, storage access (DOM/SAX), query (XML-QL), and publishing (XSL). This is good. Furthermore, almost all the products labeled "native XML database" or "XML server" seem to allow a great deal of granularity when accessing the persistent storage (on the element or even attribute level). However, only a few of them provide versioning support, which means they aren't well suited for XML-based content management.

In the end all databases allow for support of XML storage, some to a higher degree than others.

Conclusion
Compared to a well-engineered and flexible database, a file system has only limited searching, versioning, programming, and locking abilities. So using a database to make XML instances persistent is only natural. This article highlighted the different systems and their abilities. It remains the question of the particular application domain whether you'll want to use a generic or a specialized approach (e.g., if you store the DOM representation [or something similar] or a specialized object model). There are advantages to both.

Resources

  • JSR-031; XML Data Binding Specification: http://java.sun.com/aboutJava/communityprocess/jsr/jsr_031_xmld.html
  • Castor: http://castor.exolab.org/ index.html
  • IBM developerWorks: www-4.ibm.com/software/developer/library/data-binding1/index.html?dwzone=xml
    About Volker John
    Volker John is
    operations manager at Sörman, a leading provider of
    XML/SGML-related
    content-management solutions and XML-based e-business applications. He frequently lectures on XML and its benefits in today's business applications.

  • In order to post a comment you need to be registered and logged in.

    Register | Sign-in

    Reader Feedback: Page 1 of 1

    Latest Cloud Developer Stories
    In a surprise move Tuesday Oracle wheeled out its Big Data Appliance. That’s the one it said in October would be ready sometime in the first half. Only nobody believed it meant early in the first half. Heck, it’s not even clear anybody thought Oracle could make the first half...
    Rackspace Hosting, the service leader in cloud computing, on Thursday announced its acquisition of SharePoint911, an industry leader in SharePoint consulting, training, and "JumpStart" services within SharePoint. The unification of both companies provides capabilities to deliver ...
    Wyse Technology, the global leader in cloud client computing, on Thursday announced it's working with Microsoft to market school IT labs and one-to-one computing solutions that allow a cost effective delivery of innovative IT enabled education. These solutions are available throu...
    With Cloud Expo 2012 New York (10th Cloud Expo) now under four months away, what better time to start introducing you in greater detail to the distinguished individuals in our incredible Speaker Faculty for the technical and strategy sessions at the conference... We have techn...
    Nimble, the social CRM platform has announced the launch of Nimble 2.0, billed as the “most social” CRM platform on the market today. Nimble was designed entirely with social CRM in mind and is the first social business platform that empowers companies with the ability to get clo...
    Subscribe to the World's Most Powerful Newsletters
    Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
    Click to Add our RSS Feeds to the Service of Your Choice:
    Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
    myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
    Publish Your Article! Please send it to editorial(at)sys-con.com!

    Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

    SYS-CON Featured Whitepapers
    ADS BY GOOGLE