Jump to content

Which Perl module should I use to pull information from an XML file?

brian_d_foy's Photo
Posted Aug 20 2010 10:11 AM

I previously asked Which Perl XML module should I use?, but that question was too general, so the answers weren't that great in helping anyone select which module they should use. I'll ask some more specific questions.

Which Perl module should I use to pull information from an XML file? Is your suggestion good for large files? Does it take up a lot of memory? Tell me why you like it, but also tell me a little about when you wouldn't use it.

Poll: Which Perl module should I use to pull information from an XML file? (8 members have cast votes )

Which Perl module should I use to pull information from an XML file?

  1. XML::Compile (1 votes [12.50%] - View)

    Percentage of vote: 12.50%

  2. XML::DT (1 votes [12.50%] - View)

    Percentage of vote: 12.50%

  3. XML::Easy (0 votes [0.00%])

    Percentage of vote: 0.00%

  4. XML::LibXML (3 votes [37.50%] - View)

    Percentage of vote: 37.50%

  5. XML::Parser (0 votes [0.00%])

    Percentage of vote: 0.00%

  6. XML::Pastor (0 votes [0.00%])

    Percentage of vote: 0.00%

  7. XML::Rabbit (0 votes [0.00%])

    Percentage of vote: 0.00%

  8. XML::SAX (1 votes [12.50%] - View)

    Percentage of vote: 12.50%

  9. XML::Simple (2 votes [25.00%] - View)

    Percentage of vote: 25.00%

  10. XML::Toolkit (0 votes [0.00%])

    Percentage of vote: 0.00%

  11. XML::Twig (0 votes [0.00%])

    Percentage of vote: 0.00%

  12. XML::LibXSLT (0 votes [0.00%])

    Percentage of vote: 0.00%

  13. Other (0 votes [0.00%])

    Percentage of vote: 0.00%

Vote Please sign or register to vote.

1 Subscribe

4 Replies

  ruoso's Photo
Posted Aug 20 2010 12:00 PM

I'll reply just here, but the answer is valid for all the three question.

If you have a XML Schema, XML::Compile will provide a very fast interface (it compiles each XSD parser/writer to a closure) which will generate or read the XML using XML::LibXML giving you a very user-friendly data structure to handle.
  Tyler Riddle's Photo
Posted Aug 21 2010 10:54 AM

There is an important option missing: XML::CompactTree (and especially XML::CompactTree::XS) - created by the maintainer of XML::LibXML; it works around slowness induced by multiple XS to Perl space context switches by staying inside XS and building a data structure using the XML::LibXML::Reader interface giving very high processing throughput. It's the fastest I could find on CPAN and I did a comprehensive benchmark using the English Wikipedia as a dataset though I did not test all of the modules listed here. Both the XML::LibXML::Reader and XML::CompactTree provide pull based interfaces which keep memory usage low though the size of the tree returned by XML::CompactTree will vary according to the size of the subtree of the document you are reading in. These interfaces are poor choices from the area of simplicity but are good high performance options when you need them.

XML::TreePuller (my creation as the next generation unmarshalling engine for MediaWiki::DumpFile) wraps XML::CompactTree and puts a simple to use API on it with out sacrificing all of the performance gains that XML::CompactTree::XS gives you (though there is a performance penalty, no doubt). For instance, when processing the Simple English Wikipedia dump files in the MediaWiki::DumpFile suite the following XML throughput was measured (throughput is highly dependent on the markup ratio of the document as well; when parsing the standard English Wikipedia the throughput listed here about doubles):

MediaWiki-DumpFile-FastPages: 26.16 MiB/sec
MediaWiki-DumpFile-Pages: 8.32 MiB/sec
Parse-MediaWikiDump: 3.2 MiB/sec

-FastPages is a very simple parsing implementation built on top of XML::LibXML::Reader; it is unable to parse the complete XML document. Instead it pulls out just page titles and contents. By only supporting those two fields I can use the feature of the reader interface that lets me seek to a specific element in the document. There is a minimum of Perl to XS transitions as no events are generated until the element is reached; the parser is not returning character data events, elements, or the like. Large chunks of the document are completely skipped over by the parsing implementation, and more importantly, the parts of the document that are handled is limited to the extremely low markup ratio article text and title. The result is exceptional XML parsing performance that can surpass the speed of most single disks.

-Pages is an object oriented recursive descent parser implemented on top of XML::TreePuller; it's a lot slower but it supports all data in the dump files and multiple versions of dump files as well. XML::TreePuller allows -Pages to be very domain specific; that is to say it is mostly logic specific to Mediawiki dump files and not the actual XML processing. The XML::TreePuller interface is much easier to use than the older generation unmarshaller in Parse::MediaWikiDump and is much faster.

Parse-MediaWikiDump is the depreciated Mediawiki dump file parsing suite, listed as an example of how fast XML::Parser goes. When a character handler and element handlers are used in XML::Parser the parser tops out around 5 MiB/sec throughput - with careful crafting I was able to implement a configurable XML event engine that had a lowish overhead. Ultimately I decided to abandon push style parsing methods for MediaWiki::DumpFile due to the complexity of making an interruptible parser on top of them. I found the control logic when using a pull parser is much easier to conceptualize and implement.

I don't have a strict comparison between the throughput of XML::LibXML::Reader and XML::CompactTree::XS; I need to augment my benchmark suite to include this and the modules listed above I haven't tested yet.

If you have to parse more than a few gigabytes then the XML throughput can become extremely important; the Wikipedia dump files go into the terabyte range (yes, of XML, and it's all in one single file) so throughput is critical.

Another way to put this: if the question is "Which Perl module should I use to pull information from an XML file?", the answer is XML::Twig, and your file is 50 gigs and has a high markup ratio I hope you are prepared to wait the 13 hours for XML::Twig to chug through the document at approximately 1 meg/sec; If the question is "Which Perl module should I use to get reading data quickly with out having to worry to much about performance" then XML::Twig is a great answer. The simplistic voting mechanism lacks the resolution of the actual problem space here.
  bagelmuncher's Photo
Posted Sep 02 2010 12:44 PM

One to add? How about XML::SimpleObject? It works great and makes extracting data a snap.
+ 1
  YomiK's Photo
Posted Sep 04 2010 08:13 PM

I have a pretty simple application that parses a ~200k custom generated XML file, reading pretty much all of it into some hashes for the rest of the program to use.

My first implementation was 80 or so lines of regex code that reads the data. It's fast and portable in the sense that no modules are needed (irrelevant for me, but nice when shipping to people who haven't a clue about anything which isn't on the Windows desktop). The down side is that it's brittle with respect to changes to the XML data or the file format.

Next I used XML::Simple. It works, but it gets to be a chore coming up with the right set of arguments (GroupTags, ContentKey, ForceArray, SuppressEmpty, etc.) to get it to create the right array, which then needs to be massaged afterwards. Not such an issue if your DTD is static, but mine changed during development. It also had some performance issues on my ActiveState machine (about 5x slower than Strawberry Perl or Linux). Overall it seemed like a lot of work. Perhaps my data layout, my way of thinking about the problem, and XML::Simple just didn't all mesh smoothly.

After reading the praises about XML::Twig from the author, I thought I'd try it out. I think it just begs for a larger file where one is pulling out a fraction of the content. That is, it's solving a problem that my app doesn't have, so I didn't see any benefit while adding lots of scaffold code to pull out the fields.

Lastly, I used XML::LibXML. Wow -- I wish I'd found that first. It's fast. It's easy. The quantity of scaffold code is almost non-existent. It's even smaller than my regex parser subroutine. Using map { $hash{$_->getAttribute(...)} = ... } findnodes(xpath) statements is braindead simple both to write and to read. I highly recommend it.