|
Comments
Did you read today's front page stories & breaking news?
SYS-CON.TV
|
XML Protocols Manipulating And Storing XML
Manipulating And Storing XML
By: Volker John
Jan. 8, 2001 12:00 AM
XML is a metalanguage that's used to describe a language grammar. Documents that comply with a grammar formulated in XML use tags to distinguish between the actual content and its semantically relevant markup. XML defines abstract semantics in contrast to operational ones (which can be derived from them). An example of the latter is the presentation-oriented HTML. XML documents are well suited for publication in different media such as books, CDs and, of course, the Internet. A second and, at present, more important usage is applying XML to Enterprise Application Integration (EAI): messages between applications running different tasks (sometimes even in different companies) are encoded using XML. XML, a standard for data exchange and document markup, is infusing the discussion about appropriate storage mechanisms with new momentum. After all manipulating XML in the main memory isn't sufficient; you'll want to preserve your XML instances over time. This article uses some Java programming examples to explore different methods of dealing with persistent XML in the form of files, and explains what considerations are relevant when choosing a database for XML storage.
An Overview Currently XML is used primarily in the areas of messaging and content management. While the content management field has historically been occupied by SGML solutions, XML is being compared to technologies such as EDI and CORBA when it comes to message transmission. The first case involves using XML's advantages in the transmission of corporate data; the second places the interprocess communication of application solutions at the forefront. However, it would be wrong to describe XML as true middleware in the same sense as CORBA. With both applications small XML streams are frequently regarded as a complete set or entity (for example, datum representation in XML - an invoice). However, in the field of content management large units are considered (XML documents, such as the complete maintenance instructions for an aircraft). The size of these units makes it necessary to process XML documents in parts only. This involves making use of their underlying structure. It's important to remember that XML data and documents follow the same rules. It's only by means of the particular instance and its application domain that people come up with the aforementioned distinction.
In-Depth Look
Easy Ways to
Programmatic Persistence XML instances form complex and dynamic hierarchical structures. They're easily transferred into object-oriented terminology and vice versa. For example, using a specialized implementation guarantees fast execution, for example, to generate XML files from Java. Listing 2 illustrates this for the class "invoice.. The disadvantage of this method is that the xmlWrite (or equivalent) method must be implemented anew for each and every class of objects that an application requires. Plus, reprogramming is necessary whenever the schema is modified and vice versa. However, with the appropriate tools classes can be generated directly as source code from both DTDs and XML Schemas. Though most of these tools are still in the development phase, they offer simple ways to generate XML instances from objects and conversely. The more popular utilities in this area are Castor, a free implementation, and Breeze XML Studio (from Breeze Factor) in the commercial area. Sun created a Java Specification Request (JSR-031) to describe "a facility for compiling an XML Schema into one or more Java classes that can parse, generate, and validate documents that follow the schema." A series of articles on IBM's developerWorks deals with data binding for Java. Data binding in this sense refers to methods that allow the marshaling/unmarshaling of Java objects in the form of XML files. This is done by means of accessor and mutator methods that affect the underlying XML document. A direct map of element/attribute names of the XML document is established; that is, if the XML document has an element "owner", the corresponding Java class has the methods "setOwner" and "getOwner". The second approach is clearly more generic. The XML instance is transferred into a DOM object hierarchy. As a result, the classes used remain identical for each XML instance. On the other hand, the object structure can vary, as when different schemas form the basis for the structure. Figure 1 shows the DOM for an invoice XML instance. Manipulation of the XML instance on disk is done by means of a tree representation of the XML document. Based on performance and storage use, advantages can be gained by creating specialized classes for XML data. Unlike XML documents, XML data tends to be small and can be easily stored as a complete set in a system's main memory. For XML documents, it makes sense to use a generic approach because they should also be processible in fragments (a book may need to be versioned in individual chapters or sections). In addition, using a generic approach means not having to rely on specific schemas. This way the algorithms apply to all XML instances independently.
Persisting XML in Databases The basic differences between object-oriented and relational databases are easily summarized:
Since objects can't be stored as entities in an RDBMS, a paradigm shift from the object-oriented approach to the relational schema is necessary. This shift concerns the operations that are involved in both writing and constructing objects. The object identity, implemented transparently by the ODBMS, must be abandoned in an RDBMS in favor of the database identity (via primary keys). When a decomposed object is accessed, several instances may be reconstructed unintentionally unless appropriate checks are run first. When objects are stored, all references to embedded objects must be converted into foreign keys. Thus it's always the program's responsibility to ensure integrity. In addition, to avoid run-time errors pointers must be initialized manually when an object is loaded. Three user classes would typically be created to represent an Invoice, namely invoice, address, and item. If you want to store invoice "objects" efficiently, five tables are necessary in an RDBMS: Invoice, Invoice2Address (n:m), Invoice2Item (n:m), Address, and Item. In comparison, a generic minimal DOM requires only one class "Node" with just four members: name, value, type, and number of children. Of course this wouldn't allow for all XML language elements. For example, a complete model covering document type, entity, and processing instructions would require approximately 15 classes. Objects from a minimal DOM could be stored in an RDBMS using two tables: Node and Node2Node (n:m). Although administering two to five tables seems like a simple task for an RDBMS, the difficulty lies in the detail: the operation that's necessary to return the child nodes of a root node becomes overly complex with deeply structured XML instances. Information about the structure of the objects is available only from the client. A query to an RDBMS returns a result set in the form of a table that has to be traversed sequentially to derive the appropriate reactions for each entry. Consequently, the client has to actively request new nodes on a continuous basis, which adversely affects performance. While this is not a problem with a simple Invoice-type document, it clearly becomes one when the XML tree depth rises. With an RDBMS loading and manipulating objects it's necessary to use sophisticated locking schemes to exclude competing modifications. It's difficult to establish efficient locking that guarantees scalability in multiuser environments. Furthermore, the number of accesses to logically related objects increases with the number of users, and in turn raises the complexity of the locking schemes. This can cause significant degradation of a system's overall performance. In addition, data consistency must be ensured via triggers and constraints. Though this works well on the server side, the underlying structure remains hidden from the server and must be introduced on the client, adding more overhead. The content of an XML element isn't limited to a specific length, further reducing efficiency. To accommodate this aspect, the structure of a relational table has to be selected liberally (and thus inefficiently). Storing XML instances (or corresponding fragments) in binary large objects (BLOBs) is not a solution to the problems of the relational approach. Although this reduces navigational expenses and circumvents the problems that arise with variable structures/element sizes of XML instances, it results in a loss of flexibility. At best when using BLOBs, information contained in the XML instance can be found only with a full-text search due to the loss of structural characteristics. Thus, when storing XML in relational databases you shouldn't try to store a generic object model. Instead, make sure the RDBMS can do what it does best - work with sets. Restoring the structure of an XML instance is best left to the application logic. Of course, the previous paragraphs didn't cover native XML storage. But then again what does native XML storage mean anyway? Frankly, the physical storage layout shouldn't be relevant to the user. Instead the application domain dictates the requirements for the persistence mechanism. Obviously, "native" XML databases try to set themselves apart by supporting a large number of XML-related standards (or those initiatives that seem likely to succeed on their way to becoming a standard), for example, storage access (DOM/SAX), query (XML-QL), and publishing (XSL). This is good. Furthermore, almost all the products labeled "native XML database" or "XML server" seem to allow a great deal of granularity when accessing the persistent storage (on the element or even attribute level). However, only a few of them provide versioning support, which means they aren't well suited for XML-based content management. In the end all databases allow for support of XML storage, some to a higher degree than others.
Conclusion
Resources Reader Feedback: Page 1 of 1
Latest Cloud Developer Stories
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
|
SYS-CON Featured Whitepapers
Most Read This Week
Breaking Cloud Computing News
|
|||||||||||||||||||||||||||||||||||||||||||||||||