XML Protocols
Programming The SAX2 3.0 Using MSXML
Programming The SAX2 3.0 Using MSXML
Jun. 3, 2001 12:00 AM
While DOM provides a flexible way of manipulating elements in an XML document, it can be quite costly when the XML source document is large. Remember, DOM reads an XML document from a disk and builds the elements as nodes in a tree in memory. The costs involved in reading the file and maintaining a copy of it in the main memory may not be feasible if you want to retrieve only the value of one or two elements. As such, the Simple API for XML (SAX) has been promoted as a better alternative in programmatically manipulating XML documents.
What Is SAX?
SAX is simply an interface for manipulating elements in an XML document. SAX uses the event-driven processing model. A simple analogy can be used to compare DOM and SAX. In procedural programming you typically write codes (in C or Pascal, for example) that execute in a linear fashion. This can be likened to DOM, which processes the elements in an XML document one by one. Now compare this traditional processing model with the event-driven model (Visual Basic and Java are good examples). The code you write in an event-driven programming language isn't executed sequentially. Rather it's fired (or activated) when certain events happen; your code is written to service these specific events. Programming SAX is like event-driven programming - when certain elements are located, events are fired.
Benefits of SAX
SAX doesn't build a tree in memory that contains the XML document. Instead, it scans through the document in a serial fashion and generates events when sections are processed. As such, SAX takes up less memory than DOM and is better suited for processing large XML documents. In addition, you can abort processing a document when you've located a specific piece of information, unlike DOM, which must process the whole document before you can start manipulating it.
The latest release of SAX is version 2.0, which is supported by the Microsoft XML Parser release 2.0. In this article we look closely at SAX2 programming using Microsoft Visual Basic 6.0 (VB6).
Understanding SAX
Without bogging you down with too much jargon, let's try out the following example in VB6.
- Create a new project in VB6
- Add a reference to the MSXML 3.0 parser (see Figure 1)
- Add the following codes to the default form1
Private Sub Form_Load()
Dim SAXreader As New SAXXMLReader
'---For handling parsing events---
Dim contentHandler As New contentHandler
Set SAXreader.contentHandler = contentHandler
'---Parse the XML document---
SAXreader.parseURL ("c:\inetpub\Books.xml")
Exit Sub
End Sub
Add a class module and name it contentHandler (see Figure 2)
Add the events in Listing 1 to the contentHandler class. Figures 3-5 show the steps for creating the event.
Finally, here's the XML document:
Books.xml
<?xml version="1.0"?>
<Books>
<Book>
<ISBN>1861004583</ISBN>
<Title>Beginning WAP: WML and WMLScript</Title>
<Author>Wei Meng Lee et al</Author>
<Price>39.99</Price>
</Book>
</Books>
Running the Program
When you run the application the SAXXMLReader will parse the XML document. While processing, it triggers events corresponding to the various sections of the XML document:
Events
- startDocument
- startElement - Books
- startElement - Book
- startElement - ISBN
- characters - 1861004583
- endElement - ISBN
- startElement - Title
- characters - Beginning WAP: WML and WMLScript
- endElement - Title
- startElement - Author
- characters - Wei Meng Lee et al
- endElement - Author
- startElement - Price
- characters - 39.99
- endElement - Price
- endDocument
For example, when the SAXReader encounters the start tag of the <Books> element, the startElement() event is fired:
Private Sub
IVBSAXContentHandler_startElement
(strNamespaceURI As String, strLocalName
As String, strQName As String, ByVal
oAttributes As MSXML2.IVBSAXAttributes)
MsgBox "<" & strLocalName & ">" & " -
Event Generated: startElement"
End Sub
In our case, we use the msgbox() function to display a message when this event is fired (see Figure 6).
A More Useful Example
The previous example showed how SAX events are fired and how we can add code to it. Let's consider the following slightly more complex example shown in Listing 2.
Now there are three books in the XML document and the price has been coded as an attribute of the <Book> element. We want to use SAX to calculate the total price of all the books in the document. The code in Listing 3 will achieve that.
When the application is run, the dialog box in Figure 7 is displayed.
Before we leave this section to discuss the SAX2 interfaces in more detail, let's see how we can interrupt an XML document that's being processed using SAX. Suppose we're just interested in locating the title for a book with ISBN 1861003439 (see Listing 4).
When the required title is found and displayed using the msgbox() function, processing is aborted. Note that the code can be quite unwieldy when looking for a specific element. Such is the limitation of SAX - developers are required to maintain their own data structures with element values such as parents and attributes. Listing 4 is a quick hack to look for an element.
Understanding SAX2
We're now ready to delve into the technical details of SAX2.
Microsoft implements SAX2 as a COM object. Their COM/C++ implementation of SAX2 includes a number of interfaces that map to the Java-based SAX2 standard. These interfaces are exposed through the MSXML parser. Beside the C++ implementation, Microsoft has also created wrappers for the C++/COM interfaces for Microsoft Visual Basic.
The following interfaces are supported by the MSXML3:
- IMXAttributes
- IMXReaderControl
- IMXWriter
- IVBSAXAttributes
- IVBSAXContentHandler
- IVBSAXDeclHandler
- IVBSAXDTDHandler
- IVBSAXEntityResolver
- IVBSAXErrorHandler
- IVBSAXLexicalHandler
- IVBSAXLocator
- IVBSAXXMLFilter
- IVBSAXXMLReader
For a full description of the interface, refer to the documentation provided by the MSXML 3.0 SDK release, downloadable from msdn.microsoft.com/xml/default.asp.
In our earlier example we implemented the IVBSAXContentHandler interface:
Implements IVBSAXContentHandler
The IVBSAXContentHandler interface receives notification of the logical content of a document. This is the main interface you should implement when creating SAX applications.
Also remember that in the default form we have:
Dim SAXreader As New SAXXMLReader
The SAXXMLReader reads an XML document and causes events to be fired.
'---For handling parsing events---
Dim contentHandler As New contentHandler
Set SAXreader.contentHandler =
contentHandler
'---Parse the XML document---
SAXreader.parseURL
("c:\inetpub\Books1.xml")
Exit Sub
The IVBSAXContentHandler interface handles events passed by the SAXXMLReader. To load and process the XML document, use the parseURL() method provided by the SAXXMLReader object.
The IVBSAXContentHandler interface supports the methods that correspond to the events triggered by the SAXXMLReader:
- characters: Receives notification of character data
- endDocument: Receives notification of the end of a document
- startDocument: Receives notification of the beginning of a document
- endElement: Receives notification of the end of an element
- startElement: Receives notification of the beginning of an element
- ignorableWhitespace: Receives notification of ignorable white space in element content. This method is not called in the current implementation because the SAX2 implementation is nonvalidating
- endPrefixMapping: Ends the scope of a prefix-URI namespace mapping
- StartPrefixMapping: Begins the scope of a prefix-URI namespace mapping
- ProcessingInstruction: Receives notification of a processing instruction
- SkippedEntity: Receives notification of a skipped entity
*Source - MSXML 3.0 documentation
And the following property:
- DocumentLocator: Receives an interface pointer to the IVBSAXLocator Interface, which provides methods for returning the column number, line number, PublicID, or SystemID for a current document event
*Source - MSXML 3.0 documentation
Our examples implemented some of the above methods:
- Private Sub IVBSAXContentHandler_characters(strChars As String)
- Private Sub IVBSAXContentHandler_endDocument()
- Private Sub IVBSAXContentHandler_endElement(strNamespaceURI As String, strLocalName As String, strQName As String)
- Private Sub IVBSAXContentHandler_startDocument()
- Private Sub IVBSAXContentHandler_startElement(strNamespaceURI As String, strLocalName As String, strQName As String, ByVal oAttributes As MSXML2.IVBSAXAttributes)
Error Handling in SAX
To implement error handling in SAX, add a new class module and name it ErrorHandler (see Figure 8).
The code for the ErrorHandler class is in Listing 5.
We also need to make modifications to the default form (see Listing 6; modifications are shown in bold).
The IVBSAXErrorHandler interface supports the following methods:
- error: Receives notification of a recoverable error.
- fatalError: Receives notification of a non-recoverable error
- ignorableWarning: Receives notification of a warning
However, the current SAX2 implementation doesn't support the error() and ignorableWarning() methods. To see how errors are handled, let's cause a deliberate error on our XML document (see Listing 7).
We've omitted the "/" in the end tag of the <Author> element. When the application runs, the error message shown in Figure 9 is displayed.
The dialog box in Figure 9 is fired by the fatalError event:
Private Sub IVBSAXErrorHandler_fatalError(ByVal oLocator
As MSXML2.IVBSSAXLocator, strErrorMessage
As String, ByVal nErrorCode As Long)
MsgBox strErrorMessage
End Sub
When to Use SAX
Now that you've seen how SAX works, let's see what it's great for:
- SAX is suitable for large documents: While DOM requires the whole XML document to be loaded into main memory, SAX does not. This makes SAX particularly suitable for handling large XML documents, as only a portion of the document is handled at any one time. This cuts down on the memory requirements.
- SAX is good for retrieving small amounts of information: Using our previous application as an example, imagine you now have an XML document that contains a few hundred titles. If you want to simply extract the price of a title, it's clearly more efficient to use SAX than DOM as there's no need to wait for the whole XML document to be loaded in memory.
- SAX allows processing to be aborted when necessary: When the price of a title has been found, the processing of the XML document can be stopped immediately. This is a nice feature of SAX.
As in the real world there's no free lunch. SAX has its share of limitations. Most notably, they are:
- SAX doesn't allow direct access to document elements: As SAX processes an XML document sequentially, it doesn't allow access to elements that have been processed earlier. This is a severe limitation compared to DOM, which allows nodes to be accessed directly.
- Complex searches aren't easy to implement: As illustrated earlier, there's little support from SAX in letting me know the contextual information of a particular section of the document. Developers have to implement their own data structure to maintain that information.
Conclusion
In this article we've taken a quick look at how you can use the SAX interfaces in MSXML3. Developers now have primarily two ways of manipulating XML documents: DOM and SAX. Depending on the task at hand, deciding on which method to use is not always easy. Only real world deployment is going to reveal which method is more feasible. Have fun with SAX!
About Wei Meng LeeWei Meng is an author and developer specializing in XML and Web technologies. He is the co-author of XML Programming using the Microsoft XML Parser (Apress) and the series editor for Syngress' .NET Developer Series