Practical XML Parsing With Java and StaxMate

Java has no shortage of XML libraries and APIs: common ones like DOM, SAX, StAX, and JAXB, plus more esoteric ones like XOM, VTD-XML, Castor, etc. Of the more low-level XML tools (as opposed to data binding or other high-level functionality), the most common ones are DOM, SAX, and StAX, and this article explains how to use StAX effectively with StaxMate.

Between DOM, SAX, and StAX, why would one use StAX? DOM loads the entire document into memory and constructs a tree. It’s easy to navigate the tree however you wish, but having the whole document be RAM-resident is impractical with larger documents. SAX is a streaming parser, so it doesn’t have the memory usage problems that DOM does, but it’s awkward to use for many XML parsing tasks. StAX is a newer API that provides a more convenient API than SAX while delivering competitive performance.

Though StAX is easier to use than SAX, it could be better, which is where StaxMate fits in. StaxMate is a library that uses a StAX parser under the hood to get closer to the goal of DOM-like ease of use.

Contrived Example

There’s a sample project on GitHub. We’ll walk through the xml and the code step by step to show what’s going on. Try running the unit tests (with mvn clean install) and make sure everything passes.

The XML this code parses describes animals and vegetables and the various ways in which one may eat them. The XML is rather strange, but this is intentional so that different types of parsing tasks can be demonstrated.

Initialization

private static final SMInputFactory FACTORY = new SMInputFactory(new WstxInputFactory());

SMInputFactory is the starting point to using StaxMate. We only ever want to have one SMInputFactory (and one StAX XMLInputFactory) since they are somewhat expensive to create, are threadsafe, and are intended to be re-used. Here, we’re using the Woodstox implementation of StAX. StAX, like SAX or JMS, is just a specification, so there are multiple different implementations. The JDK comes with one built in (SJSXP) but the Woodstox implementation is superior.

Starting to parse

Here’s the first bit of XML (see more):

<?xml version="1.0"?>
<deliciousFoods>
  <animals>
    <animal name="Pig">
      <meat>
        <name>Prosciutto</name>
      </meat>
      <meat>
        <name>Speck</name>
      </meat>
    </animal>
    ...

We’ve got a deliciousFood root element and a list of animal element inside a container animals element. The animal name is specified as an attribute, while the meat name is specified as a child element; the difference is simply so that we can see how to handle both usages of StaxMate.

The top level of the parsing code (from here):

public Food parse(InputStream xml) throws XMLStreamException {
    Food food = new Food();

    SMHierarchicCursor rootC = FACTORY.rootElementCursor(xml);

    try {
        rootC.advance();

        SMInputCursor rootChildCursor = rootC.childElementCursor();

        while (rootChildCursor.getNext() != null) {
            handleRootChildElement(food, rootChildCursor);
        }
    } finally {
        rootC.getStreamReader().closeCompletely();
    }

    return food;
}

First, we make a StaxMate cursor for the root element. Since StaxMate is, after all, based on StAX, it’s still a fundamentally streaming-style API, but, as you’ll see (especially if you’ve ever used SAX or StAX directly), it provides some nice helpers for common tasks.

Cursors start out before the first element, just like an iterator is before the first element in a collection until you call next() the first time. So, we advance() the root cursor, and now it should be at the root element (deliciousFoods).

The childElementCursor() call creates a cursor that is filtered to only expose the start of the first level of child elements, so in this case, the first thing it produces will be the start of the animals tag. We create a Food object that we’ll fill in as we parse, and loop over every child element of deliciousFoods. The result of getNext() doesn’t need to be inspected in our usage of it, since we know (because it’s a child element cursor) that the event is always START_ELEMENT. All we care is that it’s not null. Once it’s null, the cursor has read all of its input.

Children of the root element

private void handleRootChildElement(Food food, SMInputCursor rootChildCursor) throws XMLStreamException {

    switch (rootChildCursor.getLocalName()) {
        case "animals":
            handleAnimals(food.getAnimals(), rootChildCursor.childElementCursor());
            break;
        case "vegetables":
            handleVegetables(food.getVegetables(), rootChildCursor.childElementCursor());
            break;
    }
}

In this case there are only two child elements, and we know what order they’re in, but to avoid being needlessly tied to a specific ordering of XML tags, we generally use a switch on the tag name (a handy Java 7 feature!) to make the code more flexible.

In each case, we simply get a child element cursor (which would iterate over each animal or vegetable tag) and proceed.

private void handleAnimals(List<Animal> animals, SMInputCursor animalsCursor) throws XMLStreamException {
    while (animalsCursor.getNext() != null) {
        animals.add(extractAnimal(animalsCursor));
    }
}

Here we use getNext() in a loop again to iterate over each animal and add each animal as it’s parsed to the list of animals.

Animals

private Animal extractAnimal(SMInputCursor animalsCursor) throws XMLStreamException {
    Animal a = new Animal();
    a.setName(animalsCursor.getAttrValue("name"));

    SMInputCursor meatsCursor = animalsCursor.childElementCursor();

    while (meatsCursor.getNext() != null) {
        Meat m = new Meat();
        SMInputCursor nameCursor = meatsCursor.childElementCursor().advance();
        m.setName(nameCursor.getElemStringValue());
        a.getMeats().add(m);
    }

    return a;
}

As a reminder, here’s the xml in question:

<animal name="Cow">
  <meat>
    <name>Clod</name>
  </meat>
  ...

Finally, the data we care about! We’re not worrying too much about making our business model (Animal in this case) be immutable, but we can at least make things like setName() be package-private so that the code that uses the result of our parsing can’t mess with the data.

Setting the name from the attribute is simple since the cursor is on the animal element already. Getting meat names is a bit more complicated since there’s an annoying extra level of XML hierarchy. The meat element isn’t doing anything other than holding the name element, so in a perfect world the XML wouldn’t look like this, but unfortunately this sort of XML is frequently encountered in the wild, hence its presence in our contrived example.

What we can do is create a child element cursor to iterate the children of animal (which will all be meat tags), and for each meat tag, we create another child element cursor. This last cursor will only traverse one element, the name element. Note that we still need to advance() that cursor so that we can get the element value of the name tag.

Another option (which I did not use in the sample code) is to use collectDescendantText() on the meatsCursor. If you’re sure that only one child element (name in this case) exists, you could call collectDescendantText to aggregate the text contents of the child nodes instead of creating a child element cursor.

Vegetables

Animals are done; on to the vegetables.

The xml:

<vegetables>
  <vegetable>
    <name>Brussels sprouts</name>
    <preparations>
      <preparation>Sauteed</preparation>
      <preparation>Roasted</preparation>
      <preparation>Steamed</preparation>
    </preparations>
  </vegetable>

There’s a similar top-level loop just like for animals:

private void handleVegetables(List<Vegetable> vegetables, SMInputCursor vegetablesCursor) throws
    XMLStreamException {
    while (vegetablesCursor.getNext() != null) {
        vegetables.add(extractVegetable(vegetablesCursor));
    }
}

The structure of each vegetable is different from the animals. The name is a child tag, and the ways to eat it are in a container element (preparations) and each preparation is just text in a node instead of being wrapped in a name or other extra node.

private Vegetable extractVegetable(SMInputCursor vegetableCursor) throws XMLStreamException {
    Vegetable v = new Vegetable();

    SMInputCursor vegChildCursor = vegetableCursor.childElementCursor();

    while (vegChildCursor.getNext() != null) {
        switch (vegChildCursor.getLocalName()) {
            case "name":
                v.setName(vegChildCursor.getElemStringValue());
                break;
            case "preparations":
                SMInputCursor preparationCursor = vegChildCursor.childElementCursor();
                while (preparationCursor.getNext() != null) {
                    v.getPreparations().add(preparationCursor.getElemStringValue());
                }
                break;
        }
    }

    return v;
}

Since name and preparations are the same level, we need another switch. In the preparations case, we can use getElemStringValue() directly on the child element cursor without the extra layer of cursors that was used in the animal code.

Conclusion

While for a document this small DOM would have been perfectly acceptable, this code isn’t much different from what the DOM code would look like, and is still actually a little bit faster than DOM. There’s a very crude benchmark which at least on my laptop shows that running the StAX code above takes about 57 μs to parse the file, while simply loading the DOM Document takes 74 μs. In other words, parsing the document and populating the desired data model with StAX is faster than simply doing the parsing step alone with DOM, even for a document this small. I wouldn’t read too much into this, though — it’s a very crude benchmark (little other than making sure to warm up the JIT was done, so it suffers from all sorts of problems due to on-stack replacement of the for loops, etc.). Try using VisualVM’s Visual GC or jconsole while running the benchmark; the two parsers have somewhat different allocation patterns.

Posted by Marshall Pierce

Marshall specializes in highly tuned and immensely scalable web and mobile applications. Experienced in front-end web and iOS development, he constantly pushes the boundaries of the latest browsers and mobile platforms. He splits his time with back-end development, where he is considered a domain expert in Java concurrency, distributed systems, systems design, and network security. Prior to co-founding Palomino Labs, Marshall was director of software development at Ness Computing where he led their initial launch. Before Ness, Marshall was a senior software developer at Genius.com, where he built the best-in-class integration with Salesforce.com.

About Palomino Labs

Palomino Labs unlocks the potential of software to change people and industries. Our team of experienced software developers, designers, and product strategists can help turn any idea into reality.

See the Palomino Labs website for more information, or send us an email and let's start talking about how we can work together.