Modify XML data with XML::Twig

If you need to deal with XML, first, we’re very sorry. Maybe you did something wrong if a previous life, such as munging XML with regular expressions. If you do better in this life, perhaps you won’t have to deal with XML in the next one. That right thing might be using XML::Twig, a powerful package for walking an XML tree, each part of which is a twig. For the rest of this Item, I’ll just call the module Twig.

There are two ways you can interact with your XML data using Twig. You can let Twig modify the data as it parses it, or you can parse it and modify afterward. If you have very large data, you might want to keep very little of it in memory, so you modify it as soon as you can and unload (or flush) that part of the data as you move onto the next part. If the data are small, or you need to know all of that data before you make a change, you might want to parse them completely before you start munging.

The basic Twig program, no matter which way you want to change the data, creates a new Twig object, parses the data, and flushes the output:

use XML::Twig;

my $twig = XML::Twig->new( ... );

$twig->parse( *FILE_HANDLE );

...;

$twig->flush;

Based on what you set up in the object, the parse portion can do various things, including adding to and pruning from the XML tree, renaming tags and attributes, replacing values, and almost anything else that you can program (which is just about anything). It also lets you set up non-parsing and non-munging options.

Suppose you have this very simple XML document, which has vague and generic tags. Since the data are names of cats, you want to change the tags to tags:

<?xml version="1.0"?>
<root>
	<item>Buster</item>
	<item>Mimi</item>
	<item>Roscoe</item>
	<item>Ginger</item>
	<item>Ella</item>
</root>

Start small and build up what you want. The easiest Twig problem does nothing. In this case, you use the basic structure and take the data from standard input.

use XML::Twig;

my $twig = XML::Twig->new( 
	);

$twig->parse( *STDIN );
$twig->flush;

Twig processes the input, and since you didn’t tell Twig to do anything, it doesn’t. When you flush, you see the same XML data, although with the insignificant whitespace removed:

<?xml version="1.0"?>
<root><item>Buster</item><item>Mimi</item><item>Roscoe</item><item>Ginger</item><item>Ella</item></root>

That’s a bit annoying. Figuring out how to fix this is also a bit annoying because there are so many ways that you can configure Twig. In this case, you might want to enable pretty-printing and use the indented format:

use XML::Twig;

my $twig = XML::Twig->new( 
	pretty_print => 'indented',
	);

$twig->parse( *DATA );
$twig->flush;

Now the output comes back a bit nicer, although still slightly different, but looks less of a mess:

<?xml version="1.0"?>
<root>
  <item>Buster</item>
  <item>Mimi</item>
  <item>Roscoe</item>
  <item>Ginger</item>
  <item>Ella</item>
</root>

Munge as you parse

Now that you know the basic structure of a Twig program, you can move on to the real task, changing those item tags to cat tags instead. In this section, you’ll transform those tags as you parse them.

You can set handlers to transform the data happens in handlers which you can attach to tags. The new takes a twig_handlers key that has a hash reference as an argument:

use XML::Twig;

my $twig = XML::Twig->new( 
	pretty_print  => 'indented',
	twig_handlers => {
		...   # handlers go here
		},
	);

$twig->parse( *DATA );
$twig->flush;

Each key in the twig_handlers hash is the tag name you want to handle and the value is a reference to a subroutine that gets that part of the data in $_. Anything you do only affects that part of the data. Twig provides many (many!) methods for access and munging data, and in this case, you use the set_tag method to change the name of the tag:

use XML::Twig;

my $twig = XML::Twig->new( 
	pretty_print => 'indented',
	twig_handlers => {
		item => sub { $_->set_tag( 'cat' ) },
		},
	);

$twig->parse( *DATA );
$twig->flush;
<?xml version="1.0"?>
<root>
  <cat>Buster</cat>
  <cat>Mimi</cat>
  <cat>Roscoe</cat>
  <cat>Ginger</cat>
  <cat>Ella</cat>
</root>

You don’t like the root name either, so you can change that to animals with another handler:

use XML::Twig;

my $twig = XML::Twig->new( 
	pretty_print => 'indented',
	twig_handlers => {
		item => sub { $_->set_tag( 'cat' ) },
		root => sub { $_->set_tag( 'animals' ) },
		},
	);

$twig->parse( *DATA );
$twig->flush;

Now the data aren’t so generic:

<?xml version="1.0"?>
<animals>
  <cat>Buster</cat>
  <cat>Mimi</cat>
  <cat>Roscoe</cat>
  <cat>Ginger</cat>
  <cat>Ella</cat>
</animals>

Great. That was easy. Go one step further though. Suppose you have a hash of microchip numbers for each animal and you want to add that to your data as an attribute of cat, but only if the cat has an entry in the hash. You can get the data between the opening and closing tag with the text method then use set_att to add attributes and values:

use XML::Twig;

my %microchips = qw(
	Buster  123456
	Mimi    369120
	);

my $twig = XML::Twig->new( 
	pretty_print => 'indented',
	twig_handlers => {
		root => sub { $_->set_tag( 'animals' ) },
		item => sub { 
			$_->set_tag( 'cat' );
			my $cat = $_->text;
			$_->set_att( microchip => $microchips{$cat} )
				if exists $microchips{$cat};
			},
		},
	);

$twig->parse( *DATA );
$twig->flush;

The data are a bit more fancy now that you’ve combined the microchip data:

<?xml version="1.0"?>
<animals>
  <cat microchip="123456">Buster</cat>
  <cat microchip="369120">Mimi</cat>
  <cat>Roscoe</cat>
  <cat>Ginger</cat>
  <cat>Ella</cat>
</animals>

Finally, you want to remove the cats who have passed away (sadly, rest in peace). You can prune parts of the the tree with delete, which removes that element:

use XML::Twig;

my %microchips = qw(
	Buster  123456
	Mimi    369120
	);

my %deceased = map { $_, 1 } qw(Roscoe);

my $twig = XML::Twig->new( 
	pretty_print => 'indented',
	twig_handlers => {
		root => sub { $_->set_tag( 'animals' ) },
		item => sub { 
			$_->set_tag( 'cat' );
			my $cat = $_->text;
			$_->delete if exists $deceased{$cat};
			$_->set_att( microchip => $microchips{$cat} )
				if exists $microchips{$cat};
			},
		},
	);

$twig->parse( *DATA );
$twig->flush;

Now the record for Roscoe is missing since you removed it:

<?xml version="1.0"?>
<animals>
  <cat microchip="123456">Buster</cat>
  <cat microchip="369120">Mimi</cat>
  <cat>Ginger</cat>
  <cat>Ella</cat>
</animals>

You’re still not satisfied, though. You want to make each cat’s name live in a new name tag that’s a child of cat. To do that, you can set the text of cat to the empty string, then insert a new element called name whose text is the cat’s name:

use XML::Twig;

my %microchips = qw(
	Buster  123456
	Mimi    369120
	);

my %deceased = map { $_, 1 } qw(Roscoe);

my $twig = XML::Twig->new( 
	pretty_print => 'indented',
	twig_handlers => {
		root => sub { $_->set_tag( 'animals' ) },
		item => sub { 
			$_->set_tag( 'cat' );
			my $cat = $_->text;
			$_->delete if exists $deceased{$cat};
			$_->set_att( microchip => $microchips{$cat} )
				if exists $microchips{$cat};
			$_->set_text( '' );
			$_->insert_new_elt( 'name', $cat );
			},
		},
	);

$twig->parse( *DATA );
$twig->flush;

Now your data look a bit more interesting:

<?xml version="1.0"?>
<animals>
  <cat microchip="123456"><name>Buster</name></cat>
  <cat microchip="369120"><name>Mimi</name></cat>
  <cat><name>Ginger</name></cat>
  <cat><name>Ella</name></cat>
</animals>

Work with the whole tree at once

There’s another way that you can do this. Instead of defining handlers and modifying the tree as you parse it, you can parse the entire data first and modify afterward. You start with the same basic structure, but with an addition. After you create the twig, you get the tip of the tree with root method. That serves as the starting point for your work:

use XML::Twig;

my %microchips = qw(
	Buster  123456
	Mimi    369120
	);

my %deceased = map { $_, 1 } qw(Roscoe);

my $twig = XML::Twig->new(
	pretty_print => 'indented',
	);
$twig->parse( *DATA );

my $root = $twig->root;

$twig->flush;

The first thing you do is change the root element name, just as you did before with the set_name attribute:

use XML::Twig;

my %microchips = qw(
	Buster  123456
	Mimi    369120
	);

my %deceased = map { $_, 1 } qw(Roscoe);

my $twig = XML::Twig->new(
	pretty_print => 'indented',
	);
$twig->parse( *DATA );

my $root = $twig->root;
$root->set_name( 'animals' );

$twig->flush;

Now you want to move on to the item tags. You have to climb the tree yourself, though. Once you have one element, you can get its children. in this program, you do the same thing, but after you parse the entire XML structure. The children method gets you to the next level of tags:

use XML::Twig;

my %microchips = qw(
	Buster  123456
	Mimi    369120
	);

my %deceased = map { $_, 1 } qw(Roscoe);

my $twig = XML::Twig->new(
	pretty_print => 'indented',
	);
$twig->parse( *DATA );

my $root = $twig->root;
$root->set_name( 'animals' );

foreach my $item ( $root->children( 'item' ) ) {
	$item->set_name( 'cat' );
	my $cat = $item->text;
	$item->set_text( '' );
	$item->insert_new_elt( 'name', $cat );
	$item->set_att( microchip => $microchips{$cat} )
		if exists $microchips{$cat};
	}	
	
$twig->flush;

There is a lot more that you can do with Twig, which provides a long list of methods to access various parts of the tree and another long list of methods to modify them in interesting ways. Some of these might show up as future Items.

Things to remember

  • Modify XML with XML::Twig, not regular expressions.
  • You can modify the XML data as you parse it with Twig handlers.
  • You can modify the XML data after you parse it by climbing the tree yourself.