Process XML data with XML::Twig

People often reach for regular expressions to extract and rearrange information in XML documents. Those usually only work for the limited test cases people specifically target, but are really little time-bombs waiting to go off when the data or the format changes even slightly. The bomb often explodes after the original programmer has disappeared.

It’s much easier to deal with XML with a proper XML tool, such as XML::Twig. It uses expat behind the scenes, so it handles all of the format and structural details for you and let’s you focus on the higher-level concepts. There are two ways that XML::Twig
can process data: either by parsing it completely into a tree structure first then processing it, or parsing and processing it at the same time. Which one you use depends on your data sizes and what you need to do. In this Item, you’ll have tiny data so you’ll parse it completely before processing it. You’ll see the other method in a later item.

Start with some XML output from Subversion’s svn command-line tool, just because it’s an easy source for XML text. You can add the --xml option to most of its options, such
as info:

$ svn info --xml



https://svn.example.com/svn/trunk

https://svn.example.com/svn
83cfdd44-1cdd-11df-9fc2-77fecae625a6


normal


brian.d.foy
2010-02-20T00:13:20.439142Z



Say that you want to extract the commit information. You could write a simple regular expression to extract it from exactly the format in that example output, but what happens if the Subversion developers change the order of the elements? Boom! Your bomb explodes!

XML::Twig makes this very simple to implement and even easier to maintain. First, you create a twig object and tell it what to parse:

use XML::Twig;

# run from a directory under svn control
my $xml = `svn info --xml`; 

my $twig = XML::Twig->new;

$twig->parse( $xml );

Once parsed, the twig is a tree structure of the XML data and you can do various things with it. To start your XML processing, you need to get one of the elements from the twig. There are many ways to go about this, but in this case the easiest thing to do is to get the commit element with first_elt. Save that in $commit because that’s the object you’ll interact with to extract information:

use XML::Twig;

# run from a directory under svn control
my $xml = `svn info --xml`; 

my $twig = XML::Twig->new;

$twig->parse( $xml );

my $commit   = $twig->first_elt( 'commit' );

You can then use att to extract the attribute named revision from $commit element:

use XML::Twig;

# run from a directory under svn control
my $xml = `svn info --xml`; 

my $twig = XML::Twig->new;

$twig->parse( $xml );

my $commit   = $twig->first_elt( 'commit' );

my $revision = $commit->att( 'revision' );

print "Revision $revision\n";

The twig doesn’t care about attribute order, position on the line, or anything else. It knows how to get the right information, so the output now shows just the revision number:

Revision 37

Expand this a bit to pull out more information. If you want the committer date and the name, that’s easy too. You can use first_child_text to extract the data for the child tags with the names that you specify:

use XML::Twig;

# run from a directory under svn control
my $xml = `svn info --xml`;

my $twig = XML::Twig->new;

$twig->parse( $xml );

my $root     = $twig->root;
my $commit   = $twig->first_elt( 'commit' );

my $revision = $commit->att( 'revision' );
my $author   = $commit->first_child_text( 'author' );
my $date     = $commit->first_child_text( 'date' );

print <<"HERE";
Revision: $revision
Author: $author
Date: $date
HERE

XML::Twig has many methods to work with elements. In the previous example you extracted some information and left the XML data as it was. You can also transform the data so its different at the end. If you just want the commit information, you can throw out everything else. Parse the XML data in the same way and get the element for the commit element, but once you have that element, use set_root to make commit the new top-level:

use XML::Twig;

# run from a directory under svn control
my $xml = `svn info --xml`;

my $twig = XML::Twig->new( pretty_print => 'nice' );

$twig->parse( $xml );

my $root = $twig->root;

my $commit = $twig->first_elt( 'commit' );
$twig->set_root( $commit );

$twig->print;

At the end of your twig processing, you use the print method to show the new XML structure, which is now just the element that you selected:




brian.d.foy
2010-02-20T00:13:20.439142Z

Now you want to remove that attribute named revision and make it an element instead. You use del_att to remove the attribute then insert_new_elt to create the new element under commit:

use XML::Twig;

my $xml = `svn info --xml`;

my $twig = XML::Twig->new( pretty_print => 'nice' );

$twig->parse( $xml );

my $commit   = $twig->first_elt( 'commit' );
$twig->set_root( $commit );

my $revision = $commit->att( 'revision' );
$commit->del_att( 'revision' );
$commit->insert_new_elt( revision => $revision );

$twig->print;

Now the output has an extra element for revision:




16
brian.d.foy
2010-02-20T00:13:20.439142Z

That's all there is to it. Not only does XML::Twig handle the task correctly, but it's also a lot easier to program and to understand than a regular expression. XML::Twig has many, many more methods that allow you to interact with elements in many different ways to get just the information you need or change that parts that you want.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Leave a comment

0 Comments.

Leave a Reply

You must be logged in to post a comment.