Friday, January 6, 2012

Conveniently Processing Large XML Files with Java

When processing XML data I find it most convenient to load the whole document using a DOM parser and fire some XPath-queries against the result. However, since we're building a multi-tenant eCommerce plattform we regularly have to handle large XML files, with file sizes above 1 GB. You certainly don't want to load such a beast into the heap of a production server, since it easily grows up to 3GB+ as DOM representation.

So what to do? Well, SAX to the rescue! Processing a large XML file using a SAX parser still requires constant (low) memory, since it only invokes callback for detected XML tokens. But, on the other hand, parsing complex XML really becomes a mess.

To resolve this problem we need to have a closer look at our XML input data. Most of the time, at least in our cases, you don't need the whole DOM at once. Say your importing product informations, it sufficient to look at one product at a time. Example:

<nodes>
    <node>
        <name>Node 1</name>
        <price>100</price>
    </node>
    <node>
        <name>Node 2</name>
        <price>23</price>
    </node>
    <node>
        <name>Node 3</name>
        <price>12.4</price>
        <resources>
            <resource type="test">Hello 1</resource>
            <resource type="test1">Hello 2</resource>
        </resources>
    </node>
</nodes>

When processing Node 1, we don't need access to any attribute of Node 2 or three, respectively when processing Node 2, we don't need access to Node 1 or 3, and so on. So what we want is a partial DOM, in our example for every <node>.

What we've therefore built is a SAX parser, for which you can specify in which XML elements you are interested. Once such an element starts, we record the whole sub-tree. When this completes we notify a handler which then can run XPath expressions against this partial DOM. After that, the DOM is released and the SAX parser continues.

Here is a shortened example of how you could parse the XML above - one "<node>" at a time:

   XMLReader r = new XMLReader();

   r.addHandler("node", new NodeHandler() {

     @Override
     public void process(StructuredNode node) {
       System.out.println(node.queryString("name"));
       System.out.println(node.queryValue("price").asDouble(0d));
     }
   });

   r.parse(new FileInputStream("src/examples/test.xml"));

The full example, along with the implementation is open source (MIT-License) and available here:
https://github.com/andyHa/scireumOpen/tree/master/src/com/scireum/open/xml
https://github.com/andyHa/scireumOpen/blob/master/src/examples/ExampleXML.java

We successfully handle up to five parallel imports of 1GB+ XML files in our production system, without measurable heap growth. (Instead of using a FileInputStream, we use JAVAs ZIP capabilities and directly open and process ZIP versions of the XML file. This shrinks those monsters down to 20-50MB and makes uploads etc. much easier.)


This post is the first part of the my series "Enterprisy Java" - We share our hints and tricks how to overcome the obstacles when trying to build several multi tenant web applications out of a set of common modules.

6 comments:

  1. Nice little bit of code, thanks for sharing. Apache Jersey has similar capabilities for stream parsing JSON and building a tree of just the bits you are interested in but I've always been surprised why there isn't this type of support in a core Java XML library.

    ReplyDelete
  2. Hi Andy,
    That's a nice approach indeed. I would also recommend you having a look at our approach that was specifically designed to handle very large XML input using SAX. The product allows you to specify (runtime config subscription) which types you want to process and sends these as JavaBeans to the processor component. It also allows you to specify how to handle 1..many relationships. So, for instance, say I'd like to process the element and I'd also want to process each of the elements, I could do that detaching the elements from their parent. Well, have a look hre: http://xml2java.net or here: http://xml2java.net/documents/XMLParserTechnologyForProcessingHugeXMLfiles.pdf

    Kind regards,
    Lolke Dijkstra

    ReplyDelete
  3. Hi how to get the values in one Time loading.the hanlder is not working in the order.

    ReplyDelete
    Replies
    1. Sorry, I don't get the question. The handler behaves pretty much like a sax parser. Therefore you get all values in the order of the document

      regards Andy

      Delete
  4. Not SAX, it is so 1990s, vtd-xml is the answer

    ReplyDelete
    Replies
    1. Nope not to this question...

      The first sentence on the website says:
      The world's most memory-efficient (1.3x~1.5x the size of an XML document) random-access XML parser.

      Having to deal with 1 GB large XML files, that still too much and the SAX/partial DOM approch is better for this purpose.

      Delete