This Tech Tip reprinted with permission by java.sun.com
Most Java developers that work with XML are familiar with the Simple API for XML (SAX) and the Document Object Model (DOM) libraries. SAX is an event-based API, which means that a programmer typically registers a number of listeners with the parser, and when a specific XML grammar construct is reached (for example, an element or an attribute), the listener method is called. DOM, on the other hand, has a tree-based architecture, that scans in the entire document and builds an object-tree for each grammar construct it encounters. A programmer can then access and modify the object tree after the scanning is complete.
Both of these approaches have their drawbacks: event-based APIs that make use of listeners are generally harder to work with. That's because they're driven by the parser. Tree-based APIs can consume an inordinate amount of memory in comparison to the size of the document being scanned. However, now there is a third API available for Java developers to scan XML: the Streaming API for XML parser, or StAX.
What is the SJSXP?
The Sun Java Streaming XML Parser is a high-speed implementation of StAX. BEA Systems, working in conjunction with Sun Microsystems, Inc., as well as XML-guru James Clark, Stefan Haustein, and Aleksandr Slominski (XmlPull developers), and others in the Java Community Process developed StAX as an implementation of JSR 173. StAX is a parser independent Java API based on a set of common interfaces.
The SJSXP is included with version 1.5 of the Java Web Services Developer Pack. The first thing that you're likely to notice about SJSXP is that it is based on a streaming API, which does not need to read an entire document before a developer can access any of the nodes. It also does not adhere to the principle of starting the parser and allowing the parser to "push" data to the event listener methods. Instead, SJSXP implements a "pull" method, where the parser maintains a pointer of sorts to the currently-scanned location in the document--this is often called a cursor. You simply ask the parser for the node that the cursor currently points to.
Using SJSXP to Parse XML Documents
Reading in XML documents with the SJSXP is fairly easy. Most of the
work is done through an object that implements the
interface. This interface represents a cursor that's moved across an
XML document from beginning to end. A few things to keep in mind: the
cursor always points to a single item, such as an element start-tag, a
processing instruction, or a DTD declaration. Also, the cursor always
moves forward (not backward), and you cannot perform any "look aheads"
to see what's upcoming in the document. You can obtain an
to read in XML from a file with the following snippet of code:
You can then iterate through the XML file with the following code:
hasNext() method in
checks to see if there is another item available in the XML file. If
there is one, you can use the
to advance the cursor to the next item. The
method returns an integer code that indicates the type of grammatical
construct (the item) that it encountered.
There are a number of get methods in
that you can use to obtain the contents of the XML item that the cursor
is pointing to. The first method is
public int getEventType()
The method returns an integer code that identifies the type of item the
parser found under the cursor. It's the same code returned by the
method. The items are identified by one of the following
If the item has a name, you can use the
methods to obtain the name. The latter yields the raw name, without any
extra information (for example, the name of the element without a
If you want to identify the namespace of the current item, you
If there is any accompanying text, such as the text in a DTD declaration or text inside an element, you can use the following methods to obtain them (the latter is used solely for elements):
If an element has attributes associated with it, you can use
method to obtain the number of attributes the current element has. You
can then retrieve information on each of them using the
If you know the local name of the attribute and the namespace URI of the element, you can also obtain the attribute value using the following method:
As you might have guessed, not all of the accessors methods
applicable in a specific state. For example, if you are currently
processing a DTD, you cannot call
If you do so, you will either receive a
stating that the parser has identified a conflicting event type, or the
method itself will return null.
You can turn on a number of parser properties by using the
method of the
XMLInputFactory class. For
example, the following specifies that entity references encountered by
the parser will be replaced:
To prevent the parser from supporting external entities, use the following setting:
To make the parser namespace aware, use the following setting:
Note that the current version of SJSXP will accept the following command, but the parser is currently non-validating.
If any of these
are enabled, you can use the
method to handle any errors faced by the parser. The easiest way to
determine exactly what type of error the parser encountered is to use
the following anonymous inner class in conjunction with the
Using SJSXP to Write XML Documents
Writing XML output is easy with SJSXP. In this case, you can use the
interface instead of the
interface provides direct methods to write elements, attributes,
comments, text, and all the other parts of an XML document. The
following example shows how to obtain this interface and use it to
write an XML document:
When you finish writing out each of the elements, you need to flush and close the writer.
The preceding code will output the following XML (formatted here with line breaks for easier reading):
<!--all elements here are explicitly in the HTML namespace-->
<?xml version="1.0" encoding="utf-8"?>
Java information is
Filtering XML Documents
You can create a filter for an incoming XML document if you
want to scan through each item type. To do so, create a class that
interface. This interface consists of only one method,
that accepts an
XMLStreamReader object and
returns a primitive boolean. A typical implementation of
looks like the following:
You then create a filtered reader by calling the
method of the
XMLInputFactory, and pass in
both the original XML stream reader and the
implementation. This is shown below:
For more information on the SJSXP, see the Sun Java Streaming XML Parser release notes.
Running the Sample Code for the Sun Java Streaming XML Parser
- Download the sample archive (ttfeb2005sjsxp.jar)
for this tech tip.
- Download and install Java
WSDP 1.5 from the Java Web Services Developer Pack Downloads
- Change to the directory where you downloaded the sample
archive. Uncompress the JAR file for the sample archive as follows:
jar xvf ttfeb2005sjsxp.jar
- Set your
classpathto include the
jsr173_api.jarfiles, which are located in the
sjsxp/libdirectory of the Java WSDP 1.5 installation.
- Compile and run the
In response, you should see an entry similar to the following for each XML item:
Event Type (Code=11): DTD
Without a Name
With Text: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
- Compile and run the
SJSXPOutputexecutable. The output will be sent to file named
XMLOutputFileand consist of the elements shown in the output example above.
Copyright (c) 2004-2005 Sun Microsystems, Inc.
All Rights Reserved.