Friday, January 6, 2012

Conveniently Processing Large XML Files with Java

When processing XML data I find it most convenient to load the whole document using a DOM parser and fire some XPath-queries against the result. However, since we're building a multi-tenant eCommerce plattform we regularly have to handle large XML files, with file sizes above 1 GB. You certainly don't want to load such a beast into the heap of a production server, since it easily grows up to 3GB+ as DOM representation.

So what to do? Well, SAX to the rescue! Processing a large XML file using a SAX parser still requires constant (low) memory, since it only invokes callback for detected XML tokens. But, on the other hand, parsing complex XML really becomes a mess.

To resolve this problem we need to have a closer look at our XML input data. Most of the time, at least in our cases, you don't need the whole DOM at once. Say your importing product informations, it sufficient to look at one product at a time. Example:

<nodes>
    <node>
        <name>Node 1</name>
        <price>100</price>
    </node>
    <node>
        <name>Node 2</name>
        <price>23</price>
    </node>
    <node>
        <name>Node 3</name>
        <price>12.4</price>
        <resources>
            <resource type="test">Hello 1</resource>
            <resource type="test1">Hello 2</resource>
        </resources>
    </node>
</nodes>

When processing Node 1, we don't need access to any attribute of Node 2 or three, respectively when processing Node 2, we don't need access to Node 1 or 3, and so on. So what we want is a partial DOM, in our example for every <node>.

What we've therefore built is a SAX parser, for which you can specify in which XML elements you are interested. Once such an element starts, we record the whole sub-tree. When this completes we notify a handler which then can run XPath expressions against this partial DOM. After that, the DOM is released and the SAX parser continues.

Here is a shortened example of how you could parse the XML above - one "<node>" at a time:

   XMLReader r = new XMLReader();

   r.addHandler("node", new NodeHandler() {

     @Override
     public void process(StructuredNode node) {
       System.out.println(node.queryString("name"));
       System.out.println(node.queryValue("price").asDouble(0d));
     }
   });

   r.parse(new FileInputStream("src/examples/test.xml"));

The full example, along with the implementation is open source (MIT-License) and available here:
https://github.com/andyHa/scireumOpen/tree/master/src/com/scireum/open/xml
https://github.com/andyHa/scireumOpen/blob/master/src/examples/ExampleXML.java

We successfully handle up to five parallel imports of 1GB+ XML files in our production system, without measurable heap growth. (Instead of using a FileInputStream, we use JAVAs ZIP capabilities and directly open and process ZIP versions of the XML file. This shrinks those monsters down to 20-50MB and makes uploads etc. much easier.)


This post is the first part of the my series "Enterprisy Java" - We share our hints and tricks how to overcome the obstacles when trying to build several multi tenant web applications out of a set of common modules.

127 comments:

  1. Nice little bit of code, thanks for sharing. Apache Jersey has similar capabilities for stream parsing JSON and building a tree of just the bits you are interested in but I've always been surprised why there isn't this type of support in a core Java XML library.

    ReplyDelete
    Replies
    1. Great Article android based projects

      Java Training in Chennai

      Project Center in Chennai

      Java Training in Chennai

      projects for cse

      The Angular Training covers a wide range of topics including Components, Angular Directives, Angular Services, Pipes, security fundamentals, Routing, and Angular programmability. The new Angular TRaining will lay the foundation you need to specialise in Single Page Application developer. Angular Training

      Delete
  2. Hi Andy,
    That's a nice approach indeed. I would also recommend you having a look at our approach that was specifically designed to handle very large XML input using SAX. The product allows you to specify (runtime config subscription) which types you want to process and sends these as JavaBeans to the processor component. It also allows you to specify how to handle 1..many relationships. So, for instance, say I'd like to process the element and I'd also want to process each of the elements, I could do that detaching the elements from their parent. Well, have a look hre: http://xml2java.net or here: http://xml2java.net/documents/XMLParserTechnologyForProcessingHugeXMLfiles.pdf

    Kind regards,
    Lolke Dijkstra

    ReplyDelete
  3. Hi how to get the values in one Time loading.the hanlder is not working in the order.

    ReplyDelete
    Replies
    1. Sorry, I don't get the question. The handler behaves pretty much like a sax parser. Therefore you get all values in the order of the document

      regards Andy

      Delete
  4. Not SAX, it is so 1990s, vtd-xml is the answer

    ReplyDelete
    Replies
    1. Nope not to this question...

      The first sentence on the website says:
      The world's most memory-efficient (1.3x~1.5x the size of an XML document) random-access XML parser.

      Having to deal with 1 GB large XML files, that still too much and the SAX/partial DOM approch is better for this purpose.

      Delete
  5. Thanks for the useful information, give more updates like First time I visit your site really nice, here after a daily visit.
    ecommerce website development company in chennai

    ReplyDelete
  6. Great job for publishing such a beneficial web site. Your web log isn’t only useful but it is additionally really creative too.
    Twin Cities Web Design

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. They follow the footstep of great people, read all scriptures by heart, believe in the almighty, study and remember the latest scientific studies to find some way which can lead them to success and happiness. share more details.
    Ai & Artificial Intelligence Course in Chennai
    PHP Training in Chennai
    Ethical Hacking Course in Chennai Blue Prism Training in Chennai
    UiPath Training in Chennai

    ReplyDelete
  9. I’m excited to uncover this page. I need to to thank you for ones time for this particularly fantastic read !! I definitely really liked every part of it and i also have you saved to fav to look at new information in your site.

    Data Science Course

    ReplyDelete
  10. It's really nice and meanful. it's really cool blog. Linking is very useful thing.you have really helped lots of people who visit blog and provide them usefull information.

    Data Science Training

    ReplyDelete
  11. Thanks for your post! Through your pen I found the problem up interesting! I believe there are many other people who are interested in them just like me! Thanks your shared!..
    acte velachery reviews

    acte tambaram reviews

    acte anna nagar reviews

    acte porur reviews

    acte omr reviews

    acte chennai reviews

    acte student reviews

    ReplyDelete

  12. Fantastic article I ought to say and thanks to the info. Instruction is absolutely a sticky topic. But remains one of the top issues of the time. I love your article and look forward to more.
    360DigiTMG Data Science Training Institute in Bangalore

    ReplyDelete
  13. Very good web site with the exceptional good quality goods along with I’m confident this is drastically valuable. oracle training in chennai

    ReplyDelete
  14. Additionally, this is an excellent article which I truly like studying. It's not everyday I have the option to see something similar to this.
    Data Science Course In Bangalore With Placement

    ReplyDelete
  15. Thanks for sharing amazing article it was very useful thank you.

    Data Science Training in Hyderabad

    ReplyDelete
  16. Nice article i was really impressed by seeing this article, it was very interesting and it is very useful for me.This is incredible,I feel really happy to have seen your webpage.I gained many unknown information, the way you have clearly explained is really fantastic.keep posting like this information.
    Full Stack Training in Chennai

    Full Stack Course Chennai
    Full Stack Training in Bangalore

    Full Stack Course in Bangalore

    Full Stack Training in Hyderabad

    Full Stack Course in Hyderabad

    Full Stack Training

    Full Stack Course

    Full Stack Online Training

    Full Stack Online Course



    ReplyDelete
  17. I simply want to mention I am just all new to blogging and site-building and truly loved you’re web page. Almost certainly I’m planning to bookmark your site . You really have outstanding stories. Many thanks for revealing your webpage.…
    Azure Training in Chennai

    Azure Training in Bangalore

    Azure Training in Hyderabad

    Azure Training in Pune

    Azure Training | microsoft azure certification | Azure Online Training Course

    Azure Online Training

    ReplyDelete
  18. Thanks for sharing such an great blog enjoined reading it thank you.
    Data Analytics Certification Training 360DigiTMG

    ReplyDelete
  19. I am impressed by the information on the blog keep up the good work.
    Data Analytics Certification Training

    ReplyDelete
  20. I would Thanks for sharing this wonderful content.its very useful to us.This is incredible,I feel really happy to have seen your webpage.I gained many unknown information, the way you have clearly explained is really fantastic.keep posting like this information.
    python training in bangalore

    python training in hyderabad

    python online training

    python training

    python flask training

    python flask online training

    python training in coimbatore
    python training in chennai

    python course in chennai

    python online training in chennai

    ReplyDelete
  21. Thanks for such awesome blog. Your article is very easy to understand, informative and provide complete overview about software testing. Please consider including rss feed in your website, so I get your recent post on my site.
    Salesforce Training in Chennai

    Salesforce Online Training in Chennai

    Salesforce Training in Bangalore

    Salesforce Training in Hyderabad

    Salesforce training in ameerpet

    Salesforce Training in Pune

    Salesforce Online Training

    Salesforce Training

    ReplyDelete
  22. It is really a great and helpful piece of info. I am glad that you shared this helpful information with us. Please keep us informed like this. Thank you for sharing.
    PMP Training

    ReplyDelete
  23. Thanks For sharing a nice post about Course.It is very helpful and useful for us.data science courses

    ReplyDelete
  24. keep up the good work. this is an Ossam post. This is to helpful, i have read here all post. i am impressed. thank you. this is our site please visit to know more information
    data science training in courses

    ReplyDelete
  25. We are well established IT and outsourcing firm working in the market since 2013. We are providing training to the people ,
    like- Web Design , Graphics Design , SEO, CPA Marketing & YouTube Marketing.Call us Now whatsapp: +(88) 01537587949
    :Freelancing training in Bangladesh
    Free bangla sex video:careful
    good post outsourcing institute in bangladesh

    ReplyDelete
  26. With its huge capabilities, Python is also a favorite among web developers to build various types of web applications. The web application framework, Django has been developed using Python and serves as the foundation for popular websites like 'The Guardian', 'The NY Times', 'Pinterest' and more. data science course in india

    ReplyDelete
  27. digital marketing company in chennai
    mobile app development company in chennai
    ios app development company in chennai
    shakthi tech

    shakthi technologies

    ReplyDelete
  28. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
    tally training in chennai

    hadoop training in chennai

    sap training in chennai

    oracle training in chennai

    angular js training in chennai




    ReplyDelete
  29. Superb article and I would really like to thank for your article it’s really helpful.

    Java Training in Chennai

    Java Course in Chennai

    ReplyDelete
  30. Thank you for excellent article.You made an article that is interesting.
    data science course in noida

    ReplyDelete
  31. Fantastic blog extremely good well enjoyed with the incredible informative content which surely activates the learners to gain the enough knowledge. Which in turn makes the readers to explore themselves and involve deeply in to the subject. Wish you to dispatch the similar content successively in future as well.

    Data Science training

    ReplyDelete
  32. Terrific post thoroughly enjoyed reading the blog and more over found to be the tremendous one. In fact, educating the participants with it's amazing content. Hope you share the similar content consecutively.

    Data Analytics Course in Raipur

    ReplyDelete

  33. This is a really very nice post you shared, i like the post, thanks for sharing..

    Data Science Training

    ReplyDelete
  34. I want to leave a little comment to support and wish you the best of luck.we wish you the best of luck in all your blogging enedevors.
    best data science courses in hyderabad

    ReplyDelete
  35. It is amazing and wonderful to visit your site. Thanks for sharing information; this is useful to us....
    graphic design training in Delhi

    FOR MORE INFO:

    ReplyDelete
  36. Wonderful article. Very interesting to read this article. I would like to thank you for the efforts you had made for writing this awesome article. This article resolved my all queries. Keep it up.
    Data Science Training in Hyderabad
    Data Science Course in Hyderabad

    ReplyDelete
  37. I've read this post and if I could I desire to suggest you some interesting things or suggestions. Perhaps you could write next articles referring to this article. I want to read more things about it!
    Artificial Intelligence Training in Hyderabad
    Artificial Intelligence Course in Hyderabad

    ReplyDelete
  38. Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well. I wanted to thank you for this websites! Thanks for sharing. Great websites!
    Data Science Training in Bangalore

    ReplyDelete
  39. I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!
    data analytics course in bangalore

    ReplyDelete
  40. Great post i must say and thanks for the information. Education is definitely a sticky subject. However, is still among the leading topics of our time. I appreciate your post and look forward to more.
    Data Science Course in Bangalore

    ReplyDelete
  41. This piece of writing will assist the internet visitors for setting up new webpage or even a weblog from start to end
    data scientist course in hyderabad

    ReplyDelete
  42. Thanks for sharing useful information. I learned something new from your bog. Its very interesting and informative. keep updating.Amazon Web Services Training in Chennai

    ReplyDelete
  43. Thank you for taking the time to publish this information very useful!
    data scientist training and placement in hyderabad

    ReplyDelete
  44. I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!
    data analytics course in bangalore

    ReplyDelete
  45. I am a new user of this site, so here I saw several articles and posts published on this site, I am more interested in some of them, hope you will provide more information on these topics in your next articles.
    data analytics training in bangalore

    ReplyDelete
  46. I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!
    Pmp training in hyderabad

    ReplyDelete
  47. I want to leave a little comment to support and wish you the best of luck.we wish you the best of luck in all your blogging enedevors.
    data science training in chennai

    ReplyDelete
  48. I have voiced some of the posts on your website now, and I really like your blogging style. I added it to my list of favorite blogging sites and will be back soon ...

    Data Analytics Course in Bangalore

    ReplyDelete
  49. It is late to find this act. At least one should be familiar with the fact that such events exist. I agree with your blog and will come back to inspect it further in the future, so keep your performance going.
    https://terrabellatech.blogspot.com/2015/02/mapreduce-for-c-run-native-code-in.html?showComment=1624163721414#c6260691746116257168

    ReplyDelete
  50. Train yourself in specific software modules to brush up your skills & shine in your career growth with the best software training institute in Chennai, Infycle Technologies. Infycle offers the Best Data Science Training in Chennai to serve the candidate's job profile requirements, including the top job placements in the MNC's. Rather than just teaching the theories, our fundamental aim is to make you a master by giving you live hands-on training. Therefore, individuals will be asked to work on the live tasks & real-time use cases that bring out the definite coder in you! To grab all these, call 7502633633 for a free demo.

    ReplyDelete
  51. Thank you for sharing such a informative post with us, it will beneficial for everyone, It is one of the best sites that I have visited. I am looking forward to read more blogs post from here
    Python Training in Hyderabad
    Python Course in Hyderabad

    ReplyDelete
  52. Thanks for sharing your wealthy information. This is one of the excellent posts which I have seen. I go through your all of your blog, but this blog is the best one. It is really what I wanted to see hope in future you will continue for sharing such an excellent post
    thời gian bay từ los angeles về việt nam

    đặt vé máy bay từ đức về việt nam

    ve may bay tu Anh ve Viet Nam

    có chuyến bay từ úc về việt nam không

    Giá vé máy bay Vietnam Airline tu Dai Loan ve Viet Nam

    giá vé máy bay từ canada về việt nam

    ReplyDelete
  53. Thanks for posting the best information and the blog is very important.artificial intelligence course in hyderabad

    ReplyDelete
  54. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one.
    data scientist training and placement

    ReplyDelete
  55. Thanks for posting the best information and the blog is very important.data science institutes in hyderabad

    ReplyDelete
  56. Wonderful blog found to be very impressive to come across such an awesome blog. I should really appreciate the blogger for the efforts they have put in to develop such an amazing content for all the curious readers who are very keen of being updated across every corner. Ultimately, this is an awesome experience for the readers. Anyways, thanks a lot and keep sharing the content in future too.

    Data Science Course in Bhilai

    ReplyDelete
  57. I was actually browsing the internet for certain information, accidentally came across your blog found it to be very impressive. I am elated to go with the information you have provided on this blog, eventually, it helps the readers whoever goes through this blog. Hoping you continue the spirit to inspire the readers and amaze them with your fabulous content.

    Data Science Course in Faridabad

    ReplyDelete
  58. Infycle Technologies, the excellent software training institute in Chennai offers the best Big Data Training in Chennai for freshers, students, and tech professionals. Along with the Big Data training, other demanding courses such as Cyber Security, Artificial Intelligence, Oracle, Java, Hadoop, Selenium, Android, and iOS Development, Data Science will also be trained with 100% hands-on training. Once the completion of training, the students will be sent for placement interviews in the core MNC's. Dial 7504633633 to get more info and a free demo.Best Big Data Training Chennai | Infycle Technologies

    ReplyDelete
  59. Extremely overall quite fascinating post. I was searching for this sort of data and delighted in perusing this one. Continue posting. A debt of gratitude is in order for sharing. python course in delhi

    ReplyDelete
  60. You have completed certain reliable points there. I did some research on the subject and found that almost everyone will agree with your blog.

    Business Analytics Course

    ReplyDelete
  61. I will very much appreciate the writer's choice for choosing this excellent article suitable for my topic.

    Data Analytics Course in Bangalore

    ReplyDelete
  62. This post is very simple to read and appreciate without leaving any details out. Great work!
    data science course aurangabad

    ReplyDelete
  63. I was just examining through the web looking for certain information and ran over your blog.It shows how well you understand this subject. Bookmarked this page, will return for extra. data science course in vadodara

    ReplyDelete
  64. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting.
    ServiceNow Training in Chennai

    ReplyDelete
  65. Thanks for posting the best information and the blog is very good.data science course in Lucknow

    ReplyDelete
  66. I wanted to thank you for this great read!! I definitely enjoying every little bit of it I have you bookmarked to check out new stuff you post.
    servicenow training and placement in hyderabad

    ReplyDelete
  67. I am looking for and I love to post a comment that "The content of your post is awesome" Great work!cloud computing course in patna

    ReplyDelete
  68. It is truly a well-researched content and excellent wording. I got so engaged in this material that I couldn't wait reading. I am impressed with your work and skill. Thanks.
    servicenow training and placement in hyderabad

    ReplyDelete
  69. It's like you've got the point right, but forgot to include your readers. Maybe you should think about it from different angles.
    Data Scientist Course Syllabus

    ReplyDelete
  70. You have completed certain reliable points there. I did some research on the subject and found that almost everyone will agree with your blog.

    Data Scientist Course in Bangalore

    ReplyDelete
  71. Very good message. I came across your blog and wanted to tell you that I really enjoyed reading your articles.

    Data Analytics Course in Kolkata

    ReplyDelete
  72. Infycle Technologies offers couples for care and technology in addition to Python Training in Chennai, 100% of the internship class will be prepared. After completing the training, the participants will be sent to the upper MNCs interviews. Call 750633333 to get more information and get a free display.

    ReplyDelete
  73. Thank you very much for publishing this of article, I like your article very much. Your post is very informative and helpful. Best Software Company in Bangladesh

    ReplyDelete
  74. Such a helpful article. Interesting to peruse this article.I might want to thank you for the endeavors you had made for composing this wonderful article.
    data science online training in hyderabad

    ReplyDelete
  75. Excellent and informative blog. If you want to become data scientist, then check out the following link. Data Science Course in Hyderabad

    ReplyDelete
  76. I see some amazingly important and kept up to length of your strength searching for in your on the site
    cyber security training malaysia

    ReplyDelete
  77. This is an excellent post I seen thanks to share it. It is really what I wanted to see hope in future you will continue for sharing such a excellent post.Data Science Course in Vadodara

    ReplyDelete
  78. Very informative message! There is so much information here that can help any business start a successful social media campaign!'

    Data Science Training in Kolkata

    ReplyDelete
  79. What a really awesome post this is. Truly, one of the best posts I've ever witnessed to see in my whole life. Wow, just keep it up.
    data science course

    ReplyDelete
  80. Very informative message! There is so much information here that can help any business start a successful social media campaign!


    Data Analytics Bangalore

    ReplyDelete
  81. Very informative message! There is so much information here that can help any business start a successful social media campaign!

    Data Analytics Course in Nashik

    ReplyDelete
  82. I want to leave a little comment to support and wish you the best of luck.we wish you the best of luck in all your blogging enedevors
    data science course in thiruvananthapuram

    ReplyDelete
  83. Your post was really pretty. I like it a lot. Your blog site is very informative and helpful. The term Internet Service Provider (ISP) refers to a company that provides Internet access to both personal and business customers. ISPs make it possible for their clients to surf the web, shop online runs a business, and connect with family and friends - in exchange for everything. ISPs can also provide other services including email services, domain registration, web hosting, and browser packages. Our blog site is related to Internet Service Provider (ISP). I believe that my blog site is more helpful than all other blog sites. We provide the best service to everyone. best isp in Dhaka

    ReplyDelete
  84. I want to leave a little comment to support and wish you the best of luck.we wish you the best of luck in all your blogging enedevors
    data analytics course in varanasi

    ReplyDelete
  85. I want to leave a little comment to support and wish you the best of luck.we wish you the best of luck in all your blogging enedevors
    data science course in faridabad

    ReplyDelete
  86. Wow, what great information on World Day, your exceptionally nice educational article. a debt of gratitude is owed for the position.


    Data Science Training in Erode

    ReplyDelete
  87. Mule masters Hyderabad,provides 100% job assistance, extending real time projects for practical knowledge this is best course you have interest visit my website link https://mulemasters.in/

    ReplyDelete
  88. I think this is an informative post and it is very useful and knowledgeable. therefore, I would like to thank you for the efforts you have made in writing this article.
    full stack web development course

    ReplyDelete
  89. This comment has been removed by the author.

    ReplyDelete
  90. Creative Marketers BD is a team of skilled digital marketers for offering quality SEO, Content Writing and Graphics Design, and Web Development works. We are providing Digital Marketing Services both for companies and individuals.

    Are you Looking for cost effective Logo design in bangladesh? So you are the right place to start! With Creative Marketers BD, get the best responsive website design for your online business. We are here to help your business to be a success! We can expand your business through reaching to the right audience.

    To expand your business locally or globally, Creative Marketers BD is the perfect choice for you. If you want to grow your business with the help of effective digital marketing works, Creative Marketers BD is available to assist you wholeheartedly.

    ReplyDelete
  91. Learn to perform Data Mining, Data Cleansing, Data Exploring, Feature Engineering, Prediction Model, and Data Visualization with the Data Science coaching in Bangalore. Learn to extract business-focused insights from data with the help of mathematics and statistics. Hone your skills with the combined pedagogy approach in classrooms and extensive student-faculty interaction that helps identify students for our internship program giving you the feel of a real-world professional environment.

    Data Scientist Course in Delhi

    ReplyDelete
  92. The data scientists work on the raw data to take the right insights from it for making better decisions to make the business more prosperous.
    data science training in patna

    ReplyDelete
  93. Xilisoft Video Converter Ultimate is not difficult to change over video records utilizing Xilisoft Video Converter Ultimate Crack. The bundle incorporates video transformation programming. X Video Converter Free Download Crack

    ReplyDelete
  94. Get a comprehensive overview of Data Science and learn all the essential skills including collecting, modeling, and interpreting data. Register with Data Science institute Bangalore and build a strong foundation for a career where you will be involved in uncovering valuable information for your organization. Learn Python, Machine Learning, Big Data, Deep Learning, and Analytics to take center stage in Data Science.

    Data Science Course in Bangalore

    ReplyDelete
  95. Hi there, I would like to subscribe for this web
    site to take hottest updates, thus where can i do it please
    help.uk company registration for hyip

    ReplyDelete
  96. Siblings are always there to point out our mistakes whenever we tend to step in the wrong path. Happy Sister’s Day. Happy Sister Day

    ReplyDelete
  97. Our Data Science certification training with a unique curriculum and methodology helps you to get placed in top-notch companies. Avail all the benefits and become a champion.
    data science courses in malaysia

    ReplyDelete
  98. Macrium Reflect Crack is a software application designed for Windows operating system that provides backup, imaging, and cloning capabilities for individual computers or servers. The software allows users to create full or partial backups of their hard drives, including all data, operating systems, and applications, and store them on an external storage device or network location.

    ReplyDelete