Friday, January 6, 2012

Conveniently Processing Large XML Files with Java

When processing XML data I find it most convenient to load the whole document using a DOM parser and fire some XPath-queries against the result. However, since we're building a multi-tenant eCommerce plattform we regularly have to handle large XML files, with file sizes above 1 GB. You certainly don't want to load such a beast into the heap of a production server, since it easily grows up to 3GB+ as DOM representation.

So what to do? Well, SAX to the rescue! Processing a large XML file using a SAX parser still requires constant (low) memory, since it only invokes callback for detected XML tokens. But, on the other hand, parsing complex XML really becomes a mess.

To resolve this problem we need to have a closer look at our XML input data. Most of the time, at least in our cases, you don't need the whole DOM at once. Say your importing product informations, it sufficient to look at one product at a time. Example:

<nodes>
    <node>
        <name>Node 1</name>
        <price>100</price>
    </node>
    <node>
        <name>Node 2</name>
        <price>23</price>
    </node>
    <node>
        <name>Node 3</name>
        <price>12.4</price>
        <resources>
            <resource type="test">Hello 1</resource>
            <resource type="test1">Hello 2</resource>
        </resources>
    </node>
</nodes>

When processing Node 1, we don't need access to any attribute of Node 2 or three, respectively when processing Node 2, we don't need access to Node 1 or 3, and so on. So what we want is a partial DOM, in our example for every <node>.

What we've therefore built is a SAX parser, for which you can specify in which XML elements you are interested. Once such an element starts, we record the whole sub-tree. When this completes we notify a handler which then can run XPath expressions against this partial DOM. After that, the DOM is released and the SAX parser continues.

Here is a shortened example of how you could parse the XML above - one "<node>" at a time:

   XMLReader r = new XMLReader();

   r.addHandler("node", new NodeHandler() {

     @Override
     public void process(StructuredNode node) {
       System.out.println(node.queryString("name"));
       System.out.println(node.queryValue("price").asDouble(0d));
     }
   });

   r.parse(new FileInputStream("src/examples/test.xml"));

The full example, along with the implementation is open source (MIT-License) and available here:
https://github.com/andyHa/scireumOpen/tree/master/src/com/scireum/open/xml
https://github.com/andyHa/scireumOpen/blob/master/src/examples/ExampleXML.java

We successfully handle up to five parallel imports of 1GB+ XML files in our production system, without measurable heap growth. (Instead of using a FileInputStream, we use JAVAs ZIP capabilities and directly open and process ZIP versions of the XML file. This shrinks those monsters down to 20-50MB and makes uploads etc. much easier.)


This post is the first part of the my series "Enterprisy Java" - We share our hints and tricks how to overcome the obstacles when trying to build several multi tenant web applications out of a set of common modules.

68 comments:

  1. Nice little bit of code, thanks for sharing. Apache Jersey has similar capabilities for stream parsing JSON and building a tree of just the bits you are interested in but I've always been surprised why there isn't this type of support in a core Java XML library.

    ReplyDelete
  2. Hi Andy,
    That's a nice approach indeed. I would also recommend you having a look at our approach that was specifically designed to handle very large XML input using SAX. The product allows you to specify (runtime config subscription) which types you want to process and sends these as JavaBeans to the processor component. It also allows you to specify how to handle 1..many relationships. So, for instance, say I'd like to process the element and I'd also want to process each of the elements, I could do that detaching the elements from their parent. Well, have a look hre: http://xml2java.net or here: http://xml2java.net/documents/XMLParserTechnologyForProcessingHugeXMLfiles.pdf

    Kind regards,
    Lolke Dijkstra

    ReplyDelete
  3. Hi how to get the values in one Time loading.the hanlder is not working in the order.

    ReplyDelete
    Replies
    1. Sorry, I don't get the question. The handler behaves pretty much like a sax parser. Therefore you get all values in the order of the document

      regards Andy

      Delete
  4. Not SAX, it is so 1990s, vtd-xml is the answer

    ReplyDelete
    Replies
    1. Nope not to this question...

      The first sentence on the website says:
      The world's most memory-efficient (1.3x~1.5x the size of an XML document) random-access XML parser.

      Having to deal with 1 GB large XML files, that still too much and the SAX/partial DOM approch is better for this purpose.

      Delete
  5. Great job for publishing such a beneficial web site. Your web log isn’t only useful but it is additionally really creative too.
    Twin Cities Web Design

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. They follow the footstep of great people, read all scriptures by heart, believe in the almighty, study and remember the latest scientific studies to find some way which can lead them to success and happiness. share more details.
    Ai & Artificial Intelligence Course in Chennai
    PHP Training in Chennai
    Ethical Hacking Course in Chennai Blue Prism Training in Chennai
    UiPath Training in Chennai

    ReplyDelete
  8. I’m excited to uncover this page. I need to to thank you for ones time for this particularly fantastic read !! I definitely really liked every part of it and i also have you saved to fav to look at new information in your site.

    Data Science Course

    ReplyDelete
  9. It's really nice and meanful. it's really cool blog. Linking is very useful thing.you have really helped lots of people who visit blog and provide them usefull information.

    Data Science Training

    ReplyDelete
  10. Thanks for your post! Through your pen I found the problem up interesting! I believe there are many other people who are interested in them just like me! Thanks your shared!..
    acte velachery reviews

    acte tambaram reviews

    acte anna nagar reviews

    acte porur reviews

    acte omr reviews

    acte chennai reviews

    acte student reviews

    ReplyDelete

  11. Fantastic article I ought to say and thanks to the info. Instruction is absolutely a sticky topic. But remains one of the top issues of the time. I love your article and look forward to more.
    360DigiTMG Data Science Training Institute in Bangalore

    ReplyDelete
  12. Thanks for sharing amazing article it was very useful thank you.

    Data Science Training in Hyderabad

    ReplyDelete
  13. Thanks for sharing such an great blog enjoined reading it thank you.
    Data Analytics Certification Training 360DigiTMG

    ReplyDelete
  14. I am impressed by the information on the blog keep up the good work.
    Data Analytics Certification Training

    ReplyDelete
  15. I would Thanks for sharing this wonderful content.its very useful to us.This is incredible,I feel really happy to have seen your webpage.I gained many unknown information, the way you have clearly explained is really fantastic.keep posting like this information.
    python training in bangalore

    python training in hyderabad

    python online training

    python training

    python flask training

    python flask online training

    python training in coimbatore
    python training in chennai

    python course in chennai

    python online training in chennai

    ReplyDelete
  16. Thanks for such awesome blog. Your article is very easy to understand, informative and provide complete overview about software testing. Please consider including rss feed in your website, so I get your recent post on my site.
    Salesforce Training in Chennai

    Salesforce Online Training in Chennai

    Salesforce Training in Bangalore

    Salesforce Training in Hyderabad

    Salesforce training in ameerpet

    Salesforce Training in Pune

    Salesforce Online Training

    Salesforce Training

    ReplyDelete
  17. It is really a great and helpful piece of info. I am glad that you shared this helpful information with us. Please keep us informed like this. Thank you for sharing.
    PMP Training

    ReplyDelete
  18. keep up the good work. this is an Ossam post. This is to helpful, i have read here all post. i am impressed. thank you. this is our site please visit to know more information
    data science training in courses

    ReplyDelete
  19. With its huge capabilities, Python is also a favorite among web developers to build various types of web applications. The web application framework, Django has been developed using Python and serves as the foundation for popular websites like 'The Guardian', 'The NY Times', 'Pinterest' and more. data science course in india

    ReplyDelete
  20. Superb article and I would really like to thank for your article it’s really helpful.

    Java Training in Chennai

    Java Course in Chennai

    ReplyDelete
  21. Thank you for excellent article.You made an article that is interesting.
    data science course in noida

    ReplyDelete

  22. This is a really very nice post you shared, i like the post, thanks for sharing..

    Data Science Training

    ReplyDelete
  23. It is amazing and wonderful to visit your site. Thanks for sharing information; this is useful to us....
    graphic design training in Delhi

    FOR MORE INFO:

    ReplyDelete
  24. This piece of writing will assist the internet visitors for setting up new webpage or even a weblog from start to end
    data scientist course in hyderabad

    ReplyDelete
  25. Thanks for sharing useful information. I learned something new from your bog. Its very interesting and informative. keep updating.Amazon Web Services Training in Chennai

    ReplyDelete
  26. I have voiced some of the posts on your website now, and I really like your blogging style. I added it to my list of favorite blogging sites and will be back soon ...

    Data Analytics Course in Bangalore

    ReplyDelete
  27. It is late to find this act. At least one should be familiar with the fact that such events exist. I agree with your blog and will come back to inspect it further in the future, so keep your performance going.
    https://terrabellatech.blogspot.com/2015/02/mapreduce-for-c-run-native-code-in.html?showComment=1624163721414#c6260691746116257168

    ReplyDelete
  28. Thank you for sharing such a informative post with us, it will beneficial for everyone, It is one of the best sites that I have visited. I am looking forward to read more blogs post from here
    Python Training in Hyderabad
    Python Course in Hyderabad

    ReplyDelete
  29. Thanks for sharing your wealthy information. This is one of the excellent posts which I have seen. I go through your all of your blog, but this blog is the best one. It is really what I wanted to see hope in future you will continue for sharing such an excellent post
    thời gian bay từ los angeles về việt nam

    đặt vé máy bay từ đức về việt nam

    ve may bay tu Anh ve Viet Nam

    có chuyến bay từ úc về việt nam không

    Giá vé máy bay Vietnam Airline tu Dai Loan ve Viet Nam

    giá vé máy bay từ canada về việt nam

    ReplyDelete
  30. I will very much appreciate the writer's choice for choosing this excellent article suitable for my topic.

    Data Analytics Course in Bangalore

    ReplyDelete
  31. This post is very simple to read and appreciate without leaving any details out. Great work!
    data science course aurangabad

    ReplyDelete
  32. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting.
    ServiceNow Training in Chennai

    ReplyDelete
  33. I wanted to thank you for this great read!! I definitely enjoying every little bit of it I have you bookmarked to check out new stuff you post.
    servicenow training and placement in hyderabad

    ReplyDelete
  34. It is truly a well-researched content and excellent wording. I got so engaged in this material that I couldn't wait reading. I am impressed with your work and skill. Thanks.
    servicenow training and placement in hyderabad

    ReplyDelete
  35. It's like you've got the point right, but forgot to include your readers. Maybe you should think about it from different angles.
    Data Scientist Course Syllabus

    ReplyDelete
  36. Such a helpful article. Interesting to peruse this article.I might want to thank you for the endeavors you had made for composing this wonderful article.
    data science online training in hyderabad

    ReplyDelete
  37. Excellent and informative blog. If you want to become data scientist, then check out the following link. Data Science Course in Hyderabad

    ReplyDelete
  38. I see some amazingly important and kept up to length of your strength searching for in your on the site
    cyber security training malaysia

    ReplyDelete
  39. Very informative message! There is so much information here that can help any business start a successful social media campaign!'

    Data Science Training in Kolkata

    ReplyDelete
  40. What a really awesome post this is. Truly, one of the best posts I've ever witnessed to see in my whole life. Wow, just keep it up.
    data science course

    ReplyDelete
  41. Wow, what great information on World Day, your exceptionally nice educational article. a debt of gratitude is owed for the position.


    Data Science Training in Erode

    ReplyDelete
  42. This comment has been removed by the author.

    ReplyDelete
  43. Creative Marketers BD is a team of skilled digital marketers for offering quality SEO, Content Writing and Graphics Design, and Web Development works. We are providing Digital Marketing Services both for companies and individuals.

    Are you Looking for cost effective Logo design in bangladesh? So you are the right place to start! With Creative Marketers BD, get the best responsive website design for your online business. We are here to help your business to be a success! We can expand your business through reaching to the right audience.

    To expand your business locally or globally, Creative Marketers BD is the perfect choice for you. If you want to grow your business with the help of effective digital marketing works, Creative Marketers BD is available to assist you wholeheartedly.

    ReplyDelete
  44. Xilisoft Video Converter Ultimate is not difficult to change over video records utilizing Xilisoft Video Converter Ultimate Crack. The bundle incorporates video transformation programming. X Video Converter Free Download Crack

    ReplyDelete
  45. Hi there, I would like to subscribe for this web
    site to take hottest updates, thus where can i do it please
    help.uk company registration for hyip

    ReplyDelete
  46. Siblings are always there to point out our mistakes whenever we tend to step in the wrong path. Happy Sister’s Day. Happy Sister Day

    ReplyDelete
  47. Our Data Science certification training with a unique curriculum and methodology helps you to get placed in top-notch companies. Avail all the benefits and become a champion.
    data science courses in malaysia

    ReplyDelete
  48. Macrium Reflect Crack is a software application designed for Windows operating system that provides backup, imaging, and cloning capabilities for individual computers or servers. The software allows users to create full or partial backups of their hard drives, including all data, operating systems, and applications, and store them on an external storage device or network location.

    ReplyDelete
  49. Nero Burning Crack
    Nero Burning Crack is the world’s best burning engine. Duplicate and copy excellent Albums, DVDs, and Blu-beam Plates.

    ReplyDelete
  50. Nice blog! Thanks for sharing this valuable information. Discover the ultimate learning experience for your 7th standard student with Ziyyara Edutech's private online tuition classes.
    Book A Free Demo Today visit Home tuitions for 7th standard

    ReplyDelete
  51. Excellent post and I am very happy to read this blog. Keep doing...! Ziyyara Edutech online tuition for Class 11 provides comprehensive guidance and in-depth explanations to help you excel in your physics curriculum.
    For more info Contact us: +91-9654271931, +971-505593798 or visit Online tuition for class 11

    ReplyDelete
  52. If you are a student looking for extra help in Class 11 Physics, I recommend checking out the online tuition offered by Ziyyara Edutech. Their experienced instructors provide clear explanations of physics concepts and problem-solving approaches. Lessons are tailored to align with the key learning objectives and exam pattern. Ziyyara Edutech uses interactive teaching methods like simulations, videos and quizzes to engage students. You can schedule convenient online classes from the comfort of your home. Their 1-on-1 tutoring provides personalized attention and feedback. With Ziyyara Edutech's Class 11 Physics tuition, you can gain the knowledge and skills to excel in this crucial subject. Their structured curriculum and expert guidance can help boost your academic performance. I encourage you to explore their physics tutoring programs if you need quality support and learning resources.Bigg Boss Malayalam

    ReplyDelete