Andy's Software Engineering Corner: Conveniently Processing Large XML Files with Java

Friday, January 6, 2012

Conveniently Processing Large XML Files with Java

When processing XML data I find it most convenient to load the whole document using a DOM parser and fire some XPath-queries against the result. However, since we're building a multi-tenant eCommerce plattform we regularly have to handle large XML files, with file sizes above 1 GB. You certainly don't want to load such a beast into the heap of a production server, since it easily grows up to 3GB+ as DOM representation.

So what to do? Well, SAX to the rescue! Processing a large XML file using a SAX parser still requires constant (low) memory, since it only invokes callback for detected XML tokens. But, on the other hand, parsing complex XML really becomes a mess.

To resolve this problem we need to have a closer look at our XML input data. Most of the time, at least in our cases, you don't need the whole DOM at once. Say your importing product informations, it sufficient to look at one product at a time. Example:

<nodes>

    <node>

        <name>Node 1</name>

        <price>100</price>

    </node>

    <node>

        <name>Node 2</name>

        <price>23</price>

    </node>

    <node>

        <name>Node 3</name>

        <price>12.4</price>

        <resources>

            <resource type="test">Hello 1</resource>

            <resource type="test1">Hello 2</resource>

        </resources>

    </node>

</nodes>

When processing Node 1, we don't need access to any attribute of Node 2 or three, respectively when processing Node 2, we don't need access to Node 1 or 3, and so on. So what we want is a partial DOM, in our example for every <node>.

What we've therefore built is a SAX parser, for which you can specify in which XML elements you are interested. Once such an element starts, we record the whole sub-tree. When this completes we notify a handler which then can run XPath expressions against this partial DOM. After that, the DOM is released and the SAX parser continues.

Here is a shortened example of how you could parse the XML above - one "<node>" at a time:

   XMLReader r = new XMLReader();

   r.addHandler("node", new NodeHandler() {

     @Override
     public void process(StructuredNode node) {
       System.out.println(node.queryString("name"));
       System.out.println(node.queryValue("price").asDouble(0d));
     }
   });

   r.parse(new FileInputStream("src/examples/test.xml"));

The full example, along with the implementation is open source (MIT-License) and available here:
https://github.com/andyHa/scireumOpen/tree/master/src/com/scireum/open/xml
https://github.com/andyHa/scireumOpen/blob/master/src/examples/ExampleXML.java

We successfully handle up to five parallel imports of 1GB+ XML files in our production system, without measurable heap growth. (Instead of using a FileInputStream, we use JAVAs ZIP capabilities and directly open and process ZIP versions of the XML file. This shrinks those monsters down to 20-50MB and makes uploads etc. much easier.)

This post is the first part of the my series "Enterprisy Java" - We share our hints and tricks how to overcome the obstacles when trying to build several multi tenant web applications out of a set of common modules.