Normalize your Perl source

Perl has had Unicode support since Perl 5.6, which means that most Perl tutorials have been bending the truth a bit when they tell you that a Perl identifier, the name that you give to variables, starts with [A-Za-z_] and continues with [0-9A-Za-z_]. With Unicode support, you have many more characters available to you, but moving outside the ASCII range has some problems. You can’t always tell what a variable name is just by looking at it (and this is a design bug in Perl: RT 96814). For instance, you don’t really don’t know what this variable is:

use utf8;

my $résumé = 'http://www.example.com/resume.html';

If you wanted to use that variable later in your program, what would you type? It seems simple, but Unicode has two ways to represent the é glyph. It has the composed version, (U+00E9 ????? s???? ?????? ? ???? ?????), and the decomposed version of two characters, (U+0065 ????? s???? ?????? ?) and (U+0301 ????????? ????? ??????). Depending on your editor setup, you might not get the thing that you think that you typed, even.

None of this would be a problem is Perl normalized the variable names for you. Every time that you typed é, no matter how you created that glyph, you get the same representation in the source code. However, as of Perl 5.14, Perl does not do this for you. So, it’s a problem.

Consider how the next programmer knows what your variable name is? How many variables are in this script, and do you get a warning? What is the output of this simple program?

use utf8;
use 5.010;

my $é = 'abc';
my $é = '123';

$é = 'XYZ';

say "One char = ", $é;
say "Two char = ", $é;

Now, how about this program?

use utf8;
use 5.010;

my $é = 'abc';
my $é = '123';

$é = 'XYZ';

say "One char = ", $é;
say "Two char = ", $é;

There are two possible programs because there are two possible variables at line 7. You can’t tell just by looking at the source in your editor. Depending on which variable gets the XYZ assignment, you get different outputs:

One char = XYZ
Two char = abc
One char = 123
Two char = XYZ

There’s danger in this Item since you are reading it on the web and various things might have happened to the text as it made its way through databases and web servers and web browsers, any of which may have changed the source. Here’s the program that generates the two possible programs, depending on what time it is:

use 5.010;
use utf8;
use charnames qw(:full);

my $var = time % 2 ?
	"e\N{COMBINING ACUTE ACCENT}"
	:
	"\N{LATIN SMALL LETTER E WITH ACUTE}"; 

binmode STDOUT, ':encoding(UTF-8)';
print <<"PERL";
use utf8;
use 5.010;

my \$e\N{COMBINING ACUTE ACCENT} = 'abc';
my \$\N{LATIN SMALL LETTER E WITH ACUTE} = '123';

\$$var = 'XYZ';

say "One char = ", \$\N{LATIN SMALL LETTER E WITH ACUTE};
say "Two char = ", \$e\N{COMBINING ACUTE ACCENT};
PERL

The source is encoded as UTF-8, but it's unnormalized, meaning that the different ways to represent the same glyph show up in different forms. If someone uses the form that you didn't, they actually use a different variable. Cutting and pasting may not even be safe because that process might normalize it one way or the other. Your editor may normalize it for you (but leaving other parts alone). You need the program to use the same normalization.

The simplest thing is making your editor handle it for you automatically, but if you can't do that, you might have to do it manually.

To change the normalization of a file, you can use the programs that come with Unicode::Tussle:

$ nfc program.pl > program-nfc.pl
$ nfd program.pl > program-nfc.pl

You could also make some Perl one-liners (in bash, in this case):

alias nfc="perl5.14.1 -MUnicode::Normalize -CS -ne 'print NFC(\$_)'"
alias nfd="perl5.14.1 -MUnicode::Normalize -CS -ne 'print NFD(\$_)'"

Beware, though. If you have parts of your file that need to be a particular normalization form, normalizing the entire file might change that. If you expect a string to be in NFD, perhaps to test a Unicode feature, changing the normalization will cause problems:

my $nfd_test_string = 'résumé'; # should be NFD.

However, if it's actually important for you to have a string in a particular form, you should enforce that explicitly instead of relying on the way you (or someone else) dealt with the file. You can force the normalization with Unicode::Normalize's subroutines:

use Unicode::Normalize qw(NFD);
my $nfd_test_string = NFD( 'résumé' ); # should be NFD.

Ideally, you'd handle this as part of your build process from your distribution directory so you don't have to think about it, but it's actually not simple to do that. There are two modules involved: ExtUtils::Install and ExtUtils::Manifest. The first copies files into blib in preparation for testing and installation. The second copies files listed in MANIFEST to a distribution directory. You want to be able to have the right version in both cases, but if you don't have normalized files to start you have some work to do. That's a bit beyond the scope of this Item (and a much longer discussion) that I might cover later.

Things to remember

  • Perl doesn't normalize variable names. It's a bug.
  • Normalize your Perl source one way or the other.
  • If you depend on a particular normalization in a string, force it explicitly.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Intercept warnings with a __WARN__ handler

Perl defines two internal pseudo-signals that you can trap. There’s one for die, which I covered in and eventually told you not to use. There’s also one for warn that’s quite safe to use when you need to intercept warnings.

To catch a warning, you set a signal handler for the __WARN__ pseudo-signal. The underscores around the name distinguish it from the external signals, such as INT and USR1. The value can be the name of a subroutine or a reference to a subroutine:

$SIG{__WARN__} = 'some_sub';
$SIG{__WARN__} = \&some_sub;
$SIG{__WARN__} = sub { ... };

Replacing the default behavior is a good use for a __WARN__ handler. The cluck subroutine from Carp turns the warning message into a backtrace. If you want that for all warnings, you set it up as early as possible:

BEGIN { $SIG{__WARN__} = \&Carp::cluck; }

You don’t need to change all warnings for the entire program, though. If you need to track down the code that triggers the warning, you probably want to limit your replacement behavior to the code you’re investigating:

{
local $SIG{__WARN__} = \&Carp::cluck;
...;
}

Let’s have more fun, though.

Something more fun

You can get more fancy though, because you can do almost anything you like. Here’s a little program that issues several warnings, which at first you won’t intercept. This is a nonsense program that only exists to generate warnings, some of which you may have never seen before. Note the use of Perl 5.12 for the completely legal .... Since you never call chomp, the runtime never gets a chance to make those fatal errors even though they compile just fine:

use warnings;
use v5.12;

sub chomp { ... };
*chomp = sub { ... };
chomp( $ARGV[0] );
my $sum;
exec 'Buster';
print (STDOUT), 1, 2, 3;
print $a;
accept( SOCKET, GENERIC );
connect( SOCKET, 'Mimi' );
chmod 777, 'Mimi';
open FOO, '|Buster|';
close FOO;

say 'At the end!';

The warnings are legion:

Ambiguous call resolved as CORE::chomp(), qualify as such or use & at warnings line 8.
print (...) interpreted as function at warnings line 14.
Useless use of a constant (2) in void context at warnings line 14.
Useless use of a constant (3) in void context at warnings line 14.
Statement unlikely to be reached at warnings line 14.
        (Maybe you meant system() when you said exec()?)
Name "main::a" used only once: possible typo at warnings line 15.
Name "main::GENERIC" used only once: possible typo at warnings line 17.
Subroutine main::chomp redefined at warnings line 6.
Use of uninitialized value $ARGV[0] in scalar chomp at warnings line 8.
Can't exec "Buster": No such file or directory at warnings line 11.
Use of uninitialized value $_ in print at warnings line 14.
Use of uninitialized value $a in print at warnings line 15.
accept() on unopened socket GENERIC at warnings line 17.
connect() on unopened socket SOCKET at warnings line 18.
Can't open bidirectional pipe at warnings line 20.
Can't exec "Buster": No such file or directory at warnings line 20.
At the end!

Suppose that you wanted to count those warnings, though? You could set up a handler. Since many of those warnings are from the compile-phase, you have to set up the handler at compile time by using a BEGIN block:

use strict;
use warnings;
use v5.12;

BEGIN {
	$SIG{__WARN__} = sub {
		state $count = 0;
		printf '[%04d] %s', $count++, @_;
		};
	}

sub chomp { ... };
*chomp = sub { ... };
chomp( $ARGV[0] );
my $sum;
exec 'Buster';
print (STDOUT), 1, 2, 3;
print $a;
accept( SOCKET, GENERIC );
connect( SOCKET, 'Mimi' );
chmod 777, 'Mimi';
open FOO, '|Buster|';
close FOO;

say 'At the end!';

Now you see each warning has a number:

[0000] Ambiguous call resolved as CORE::chomp(), qualify as such or use & at warnings line 14.
[0001] print (...) interpreted as function at warnings line 17.
[0002] Useless use of a constant (2) in void context at warnings line 17.
[0003] Useless use of a constant (3) in void context at warnings line 17.
[0004] Statement unlikely to be reached at warnings line 17.
[0005]  (Maybe you meant system() when you said exec()?)
[0006] Name "main::a" used only once: possible typo at warnings line 18.
[0007] Name "main::GENERIC" used only once: possible typo at warnings line 19.
[0008] Subroutine main::chomp redefined at warnings line 13.
[0009] Use of uninitialized value $ARGV[0] in scalar chomp at warnings line 14.
[0010] Can't exec "Buster": No such file or directory at warnings line 16.
[0011] Use of uninitialized value $_ in print at warnings line 17.
[0012] Use of uninitialized value $a in print at warnings line 18.
[0013] accept() on unopened socket GENERIC at warnings line 19.
[0014] connect() on unopened socket SOCKET at warnings line 20.
[0015] Can't open bidirectional pipe at warnings line 22.
[0016] Can't exec "Buster": No such file or directory at warnings line 22.
At the end!

That’s interesting, but it can be even more interesting. Can you label the ones that are from the compile phase? You can check the phase with the ${^GLOBAL_PHASE} variable added to Perl 5.14:

use v5.14;

BEGIN {
	$SIG{__WARN__} = sub {
		state $count = 0;
		printf '[%04d] %s - %s', $count++, ${^GLOBAL_PHASE}, @_;
		};
	}

# ... rest of program

Now the output shows the phase too:

[0000] START - Ambiguous call resolved as CORE::chomp(), qualify as such or use & at warnings line 14.
[0001] START - print (...) interpreted as function at warnings line 17.
[0002] START - Useless use of a constant (2) in void context at warnings line 17.
[0003] START - Useless use of a constant (3) in void context at warnings line 17.
[0004] START - Statement unlikely to be reached at warnings line 17.
[0005] START -  (Maybe you meant system() when you said exec()?)
[0006] START - Name "main::a" used only once: possible typo at warnings line 18.
[0007] START - Name "main::GENERIC" used only once: possible typo at warnings line 19.
[0008] RUN - Subroutine main::chomp redefined at warnings line 13.
[0009] RUN - Use of uninitialized value $ARGV[0] in scalar chomp at warnings line 14.
[0010] RUN - Can't exec "Buster": No such file or directory at warnings line 16.
[0011] RUN - Use of uninitialized value $_ in print at warnings line 17.
[0012] RUN - Use of uninitialized value $a in print at warnings line 18.
[0013] RUN - accept() on unopened socket GENERIC at warnings line 19.
[0014] RUN - connect() on unopened socket SOCKET at warnings line 20.
[0015] RUN - Can't open bidirectional pipe at warnings line 22.
[0016] RUN - Can't exec "Buster": No such file or directory at warnings line 22.
At the end!

Now each phase has its own warning counter:

START-0000  Ambiguous call resolved as CORE::chomp(), qualify as such or use & at warnings line 15.
START-0001  print (...) interpreted as function at warnings line 18.
START-0002  Useless use of a constant (2) in void context at warnings line 18.
START-0003  Useless use of a constant (3) in void context at warnings line 18.
START-0004  Statement unlikely to be reached at warnings line 18.
START-0005      (Maybe you meant system() when you said exec()?)
START-0006  Name "main::a" used only once: possible typo at warnings line 19.
START-0007  Name "main::GENERIC" used only once: possible typo at warnings line 20.
RUN-0000  Subroutine main::chomp redefined at warnings line 14.
RUN-0001  Use of uninitialized value $ARGV[0] in scalar chomp at warnings line 15.
RUN-0002  Can't exec "Buster": No such file or directory at warnings line 17.
RUN-0003  Use of uninitialized value $_ in print at warnings line 18.
RUN-0004  Use of uninitialized value $a in print at warnings line 19.
RUN-0005  accept() on unopened socket GENERIC at warnings line 20.
RUN-0006  connect() on unopened socket SOCKET at warnings line 21.
RUN-0007  Can't open bidirectional pipe at warnings line 23.
RUN-0008  Can't exec "Buster": No such file or directory at warnings line 23.
At the end!

This leads to a deliciously evil plan: what if you can stop your program from running if it had more warnings than it did on the last run? The Test::Perl::Critic::Progressive module that already does something similar for Perl::Critic. Inside this __WARN__, you can use a die to stop the program:

use strict;
use warnings;
use v5.12;

BEGIN {
	my $file = "$0.warn";
	my $count = {};

	$SIG{__WARN__} = sub { # refactor when you figure it out
		state $previous_counts = do {
			unless( -e $file ) { my $hash = {} }
			else {
				local @ARGV = $file;
				my $hash;
				while( <> ) {
					chomp;
					my( $phase, $count ) = split;
					$hash->{$phase} = $count;
					}
				$hash;
				}
			};

		$count->{${^GLOBAL_PHASE}}++;

		die "Too many warnings in ${^GLOBAL_PHASE}\n"
			if $count->{${^GLOBAL_PHASE}} >
				( $previous_counts->{${^GLOBAL_PHASE}} // 0 ); #/

		printf '%s-%04d  %s',
			${^GLOBAL_PHASE}, $count->{${^GLOBAL_PHASE}}, @_;

		};

	END { # inside a BEGIN!
		open my $f, '>', $file;
		while( my( $k, $v ) = each %$count ) {
			say $f "$k $v";
			}
		}
	}

sub chomp { ... };
*chomp = sub { ... };

# chomp( @ARGV );  # uncomment for another warning

say 'At the end!';

When you run this, the program stops when it encounters more errors that it did before:

$ perl5.14.1 warnings
START-0001  Ambiguous call resolved as CORE::chomp(), qualify as such or use & at warnings line 46.
Too many warnings in RUN

Things to remember

  • You can intercept warnings with $SIG{__WARN__}
  • Set up $SIG{__WARN__} in a BEGIN to intercept warnings right away

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Know the difference between utf8 and UTF-8

Perl actually has two encodings that get the letters u, t, f, and 8. One will happily let you do bad things, and the other will let you do bad things but with a warning that you can make fatal.

There’s an encoding layer with the name :utf8 and there’s the encoding name UTF-8 that you use with :encoding:

binmode $fh, ':utf8';
binmode $fh, ':encoding(UTF-8)';

You can even use the non-hyphen version with :encoding:

binmode $fh, ':encoding(UTF8)';

These aren’t the same thing. The :utf8 layer comes from Perl 5.6, the first version of Perl that had even rudimentary Unicode support. It encodes any characters in the range from 0 to 0xFFFF_FFFF. That is, it allows for a 32-bit encoding space. You have no problem with this code:

use 5.014;
use strict;
use warnings;

my $string = "invalid -> \x{11000}";
my $output;

{
open my $string_fh, '>:utf8', \$output;
print $string_fh $string;
}

{
open my $string_fh, '<:raw', \$output;
my @values = map { sprintf '%X', ord } split //, readline( $string_fh );
say join ' ', @values;
}

This code writes to a string filehandle using the loose utf8 encoding and opens another read filehandle using the raw filehandle so you can see the bytes without any processing. The output shows the bytes in the output. The F0 91 80 80 represents the invalid character:

69 6E 76 61 6C 69 64 20 2D 3E 20 F0 91 80 80

Going the other way, reading in the file with the same encoding, doesn't cause any problems either.

use 5.014;
use strict;
use warnings;

my $string = "invalid -> \x{11000}";
my $output;

{
open my $string_fh, '>:utf8', \$output;
print $string_fh $string;
}

{
open my $string_fh, '<:utf8', \$output;
my @values = map { sprintf '%X', ord } split //, readline( $string_fh );
say join ' ', @values;
}

When you use the same layer to read the data, you get the same characters you started. Instead of F0 91 80 80 you get 11000:

69 6E 76 61 6C 69 64 20 2D 3E 20 11000

However, the Universal Character Set highest valid code number is 0x10FFFF, and even some of the characters inside that range aren't valid in UTF-8, such as the surrogates in the range 0xD800–DFFF, which you use to encode characters in the supplementary plane in UTF-16. If none of that makes sense, just remember that UTF-16 comes from the time when we thought the UCS would be a 16-bit encoding space and that two bytes would be enough for everyone (and how often has that not be true in history?). The "characters" in the surrogate range aren't characters. They are an ugly hack to let an ancient 16-bit system deal with a 21-bit system. You shouldn't be able to successfully read those characters.

use 5.014;
use strict;
use warnings;

my $string = "invalid -> \x{D800}";
my $output;

{
open my $string_fh, '>:utf8', \$output;
print $string_fh $string;
}

{
open my $string_fh, '<:utf8', \$output;
my @values = map { sprintf '%X', ord } split //, readline( $string_fh );
say join ' ', @values;
}

This code at least emits a warning:

Unicode surrogate U+D800 is illegal in UTF-8 at invalide.pl line 9.
69 6E 76 61 6C 69 64 20 2D 3E 20 D800

You only get this warning if you turn on warnings in Perls 5.10 and 5.12, but you get it even without warnings in Perl 5.14. But, it still works.

Try any of this with the actual UTF-8 encoding though, and odd things ensue:

use 5.010;
use strict;
use warnings;

my $string = "invalid -> \x{D800}";
my $output;

{
open my $string_fh, '>:utf8', \$output;
print $string_fh $string;
}

{
open my $string_fh, '<:encoding(UTF-8)', \$output;
my @values = map { sprintf '%X', ord } split //, readline( $string_fh );
say join ' ', @values;
}

The output gives two different warnings, and some odd output:

Unicode surrogate U+D800 is illegal in UTF-8 at invalide.pl line 10.
utf8 "\xD800" does not map to Unicode at invalide.pl line 15.
69 6E 76 61 6C 69 64 20 2D 3E 20 5C 78 7B 44 38 30 30 7D

That output is much longer than the previous output. Now you get 5C 78 7B 44 38 30 30 7D. If you know your code points, you'll recognize that as the literal characters \x{D800}.

You can convince yourself that this happens by creating the encoded string directly:

use 5.010;
use strict;
use warnings;

my $string = pack 'C*', map { hex } split /\s/,
	'69 6E 76 61 6C 69 64 20 2D 3E 20 ED A0 80';
say $string;

open my $string_fh, '<:encoding(UTF-8)', \$string;
my $read = readline( $string_fh );
say $read;
my @values = map { sprintf '%X', ord } split //, $read;
say join ' ', @values;

You get the same output, still with a warning:

invalid -> í ?
utf8 "\xD800" does not map to Unicode at invalide.pl line 10.
invalid -> \x{D800}
69 6E 76 61 6C 69 64 20 2D 3E 20 5C 78 7B 44 38 30 30 7D

This is a problem. The data you get aren't the data that are in the file. Writing the data with UTF-8 doesn't give a warning either:

use 5.010;
use strict;
use warnings;

my $string = "invalid -> \x{D800}";
my $output;

{
open my $string_fh, '>:encoding(UTF-8)', \$output;
print $string_fh $string;
}

{
open my $string_fh, '<:raw', \$output;
my @values = map { sprintf '%X', ord } split //, readline( $string_fh );
say join ' ', @values;
}

The output is:

"\x{d800}" does not map to utf8 at invalide.pl line 9.
69 6E 76 61 6C 69 64 20 2D 3E 20 5C 78 7B 44 38 30 30 7D

Huh? Perl will happily write the data, changing it on the way out. That's no good. Why is this happening?

There are several ways that Perl can deal with bad data as it encodes. That's not to say any of them are how Perl should deal with those data, but that's not the point. In this case, the Encode module is using its internal perlqq mode. When it finds an invalid character, it turns it into its code number and puts \x{} around it. If you were using the Encode module directly, you have control over those invalid characters.

use 5.010;
use strict;
use warnings;

use Encode qw(encode :fallbacks);

my $string = "invalid -> \x{D800}";

$string = encode( 'UTF-8', $string, FB_PERLQQ ); # what you already have

say 'The string is now[ ', $string, ']';

The output is what you got before (but without a warning because its handling is explicit):

The string is now[ invalid -> \x{D800}]

The other constants give different results:

Constant Effect String
FB_PERLQQ Replace with XML entity Convert to \x{NNNN}
FB_XMLCREF Replace with XML entity Convert to &#xdddd;
FB_HTMLCREF Replace with HTML entity Convert to &#dddddd;
FB_DEFAULT Replace with the substitution character Convert to �
FB_CROAK Die
FB_QUIET Stop encoding, with no warning
FB_WARN Stop encoding, with a warning

You probably don't want to handle everything at that level in most cases, though. If you have invalid data, you need to fix that before it gets out to the world. You have the warning though. That means that you can make that operation fatal without going through Encode:

use warnings qw(FATAL utf8);

Things to remember

  • The :utf8 encoding, and variations on it without a hyphen, is Perl's looser encoding.
  • Using UTF-8, in any case and with either a hyphen or underscore, is the strict, valid encoding and gives a warning for invalid sequences.
  • Only use the :encoding(UTF-8) and make its warnings fatal.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Know the difference between character strings and UTF-8 strings

Normally, you shouldn’t have to care about a string’s encoding. Indeed, the abstract string has no encoding. It exists as an idea without a representation and it’s not until you want to put it on disk, send it down a pipe, or otherwise force it to exist as electrical pulses, magnetic pole orientation, and so on. All stored data, even ASCII, has an encoding. Until you force it to have a bit pattern to live in the tangible world, you shouldn’t have to worry about anything like an encoding.

An abstract character string is one where Perl can recognize each grapheme cluster as a unit, and there is no encoding involved at the user level. Perl has to store these, but you don’t (shouldn’t) play with the string at that level.

A UTF-8–encoded string is one where the octets in the string are the same as in the UTF-8 representation. Perl sees a string of octets and cannot recognize grapheme clusters.

Consider this example. In

use v5.14;
use utf8;

# # # Abstract character string
my $char_string = 'Büster';

say "Length of char string is ", length $char_string; #6
say join " ", map { sprintf '%X', ord } split //, $char_string;

# # # UTF-8–encoded octet string
open my $fh, '>:utf8', \my $utf8_string;
print $fh $char_string;
close $fh;

say "Length of utf8 string is ", length $utf8_string; # 7
say join " ", map { sprintf '%X', ord } split //, $utf8_string;

The output shows that the same are two are different things because one is a string of characters and one a string of octets:

use v5.14;
use utf8;

# # # Abstract character string
my $char_string = 'Büster';

say "Length of char string is ", length $char_string; #6
say join " ", map { sprintf '%X', ord } split //, $char_string;

# # # UTF-8–encoded octet string
open my $fh, '>:utf8', \my $utf8_string;
print $fh $char_string;
close $fh;

say "Length of utf8 string is ", length $utf8_string; # 7
say join " ", map { sprintf '%X', ord } split //, $utf8_string;

The output shows the difference. In the character string, the ü shows up as the single character with code number 0xFC. In the UTF-8 version, the code number 0xFC is represented as 0xC3 0xBC. Since this is just a string of octets, Perl thinks that this version is one character longer:

Length of char string is 6
42 FC 73 74 65 72
Length of utf8 string is 7
42 C3 BC 73 74 65 72

For most of your programming, you shouldn’t have to care about encoding. You want to have character data with no representation and operate on abstract characters. You don’t care at all about the encoding and how many bytes a character turns into. That’s merely a storage issue. Virtually no one can tell you, off the top of their heads, what the UTF-8 representation of a string is because no one thinks in UTF-8. No one wants to do that during string manipulation, either.

The problem is that some interfaces want the encoded data instead of the abstract character string. These modules usually expect that you’re giving it data directly from another source without turning it into a Perl string. If you need to review these concepts, check out the “Unicode” chapter in Effective Perl Programming.

Consider the JSON module’s decode function expects a UTF-8–encoded string, thinking you’re going to take it directly from an HTTP response. This item is not about using this module correctly, but it’s a convenient example for the general idea.

This works just fine because the value in $json_data is a UTF-8–encoded string instead of a abstract character string:

use JSON;
use LWP::Simple qw(get);

my $json_data = get( 'http://www.example.com/data.json' );

my $perl_hash = decode_json( $json_data );

The decode_json doesn’t expect you to do anything with the data that you get from the website before you give it to decode_json, who’s job it is to both decode the data and to convert the data from JSON to Perl. It’s documented this way. Instead of making you decode it in the response, it uses the data just as you would get it in the message body of the HTTP response.

If you are doing extra processing, however, you can get in trouble. For instance, the HTTP::Response object can decode the message body for you, turning UTF-8 data into an abstract character string. If you call decoded_content and pass the result to decode_json, it fails:

use Encode;
use JSON;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

my $response = $ua->get( 'http://www252.pair.com/~comdog/for/data.json' );

my $content = $response->content;
print "Length content is ", length $content, "\n";

my $decoded_content = $response->decoded_content;
print "Length decoded content is ", length $decoded_content, "\n";

# this is fine
my $perl_hash = decode_json( $content );

# this is not fine
my $decoded_hash = decode_json( $decoded_content );

If you have your input string as an abstract character string, the decode method might fail. If it’s all characters in the ASCII range, it doesn’t matter because the UTF-8 representation is the same as the ASCII representation:

use utf8;
use JSON;

my $json_data = q( { "cat" : "Buster" } );

my $perl_hash = decode_json( $json_data );

Give it something outside the ASCII range, and things go wrong:

use utf8;
use JSON;

my $json_data = qq( { "cat" : "Büster" } );

my $perl_hash = decode_json( $json_data );

The error says it has a malformed UTF-8 character. In an abstract character string, the ü is 0xFC, which isn’t a valid UTF-8 sequence:

malformed UTF-8 character in JSON string, at character offset 13 (before "\x{33d25ca2} } ") at string.pl line 6.

In this case, you need to turn your abstract character string into a UTF-8–encoded string, just like it would look as if you had stored it in a file. You can encode it (going from the abstract character string to the UTF-8 version) with the Encode module (Item 75. Convert octet strings to character strings.):

use utf8;
use Encode qw(encode_utf8);
use JSON;

my $json_data = qq( { "cat" : "Büster" } );
$json_data = encode_utf8( $json_data );

my $perl_hash = decode_json( $json_data );

You can also print to a scalar reference, using the encoding that you need (Item 54. Open filehandles to and from strings):

use utf8;
use Encode qw(encode_utf8);
use JSON;

my $json_data = qq( { "cat" : "Büster" } );
open my $fh, '>:utf8', \my $utf8_string;
print $fh $json_data;

my $perl_hash = decode_json( $utf8_string );

If you already have the text in a file and need it un-decoded, you can read it with the :raw layer so perl does not decode it (possibly with default layers set far away):

use Encode qw(encode_utf8);
use JSON qw(decode);

open my $fh, '<:raw', $file;
my $json_data = do { local $/; <$fh> };

my $perl_hash = decode_json( $utf8_string );

Doing it differently in JSON

You don’t have to use JSON‘s decode_json function. Using the object interface, you can tell the decoder what you’re giving it. If you want to give it a UTF-8–encoded string, you tell it to expect UTF-8:

use JSON;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

my $response = $ua->get( 'http://www.example.com/data.json' );

my $content = $response->content;

my $perl_hash = JSON->new->utf8->decode( $content );

If you want to give it character data, you don’t tell the object to expect UTF-8:

use JSON;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

my $response = $ua->get( 'http://www.example.com/data.json' );

my $content = $response->decoded_content;

my $perl_hash = JSON->new->decode( $decoded_content ); # no ->utf8

Things to remember

  • Character string have no encoding, and Perl can recognize its grapheme clusters
  • An encoded string is a series of octets that Perl doesn’t recognize as grapheme clusters
  • Check your interface to see which one you should use

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Use a Task distribution to specify groups of modules

Create Task distributions

A Task distribution is like a normal Perl distribution in structure, but it doesn’t actually provide any code. It lists as pre-requisites all of the modules or distributions that you want to install so you can use a conventiional CPAN tool to install all of the dependencies. A Task is slightly different from the older way, a Bundle, but for most people and uses, a Task might be a better way.

You can create your Task::* distribution with your favorite module tool (see Item 80. Don’t start distributions by hand). Once you have the basic structure, you edit the module file to document what your task does and you edit the build file to list the dependencies.

Suppose that you wanted to make one to install all of the modules mentioned in Effective Perl Programming (and we have in Task::EffectivePerlProgramming).

To do this yourself, you first create the stub distribution:

$ module-starter --module=Task::EffectivePerlProgramming
Class is Module::Starter
Created Task-EffectivePerlProgramming
Created Task-EffectivePerlProgramming/lib/Task
Created Task-EffectivePerlProgramming/lib/Task/EffectivePerlProgramming.pm
Created Task-EffectivePerlProgramming/t
Created Task-EffectivePerlProgramming/t/pod-coverage.t
Created Task-EffectivePerlProgramming/t/pod.t
Created Task-EffectivePerlProgramming/t/manifest.t
Created Task-EffectivePerlProgramming/t/boilerplate.t
Created Task-EffectivePerlProgramming/t/00-load.t
Created Task-EffectivePerlProgramming/ignore.txt
Created Task-EffectivePerlProgramming/Changes
Created Task-EffectivePerlProgramming/Build.PL
Created Task-EffectivePerlProgramming/README
Created Task-EffectivePerlProgramming/MANIFEST
Created starter directories and files

Next, you edit the .pm file to document your Task::* and to set a version number (the date is a good choice). Since this is a distribution like any other Perl distribution, when you install the Task, not only will the CPAN tool handle the dependencies, but it will also install the Task::* module, in this case Task::EffectivePerlProgramming. By adding a version to your distribution, you can update it later and use your CPAN tool to update your dependencies.

Here’s an extract of lib/Task/EffectivePerlProgramming.pm file:

package Task::EffectivePerlProgramming;

our $VERSION = '20100714';

=head1 NAME

Task::EffectivePerlProgramming - All of the modules mentioned in Effective Perl Programming, 2nd Edition

=head1 SYNOPSIS

This is just a Task module to install dependencies. There's no code to use
or run.

=head1 DESCRIPTION

These are the modules we used in the book:

=over 4

=item * Apache::DBI

=item * Apache::Perldoc

=item * etc, etc

=back

=head1 LICENSE AND COPYRIGHT

Copyright 2010 brian d foy.

You can distribute this module under the same terms as Perl itself.

=cut

1;

In the Build.PL, you then list each of the modules that you want to install as a dependency. If you don’t care which version of the dependency you want to install, you can use '0' as the version since any version should be equal to or greater than that:

use strict;
use warnings;
use Module::Build;

my $builder = Module::Build->new(
    module_name         => 'Task::EffectivePerlProgramming',
    license             => 'perl',
    dist_author         => q{brian d foy },
    dist_version_from   => 'lib/Task/EffectivePerlProgramming.pm',
    requires            => {
		'Apache::DBI'     => '0',
		'Apache::Perldoc' => '0',
		'App::Ack'        => '0',
		# etc, etc,
		},
    build_requires => {
        'Test::More' => 0,
    	},
    add_to_cleanup     => [ 'Task-EffectivePerlProgramming-*' ],
    create_makefile_pl => 'traditional',
	);

$builder->create_build_script();

That’s it; you’re done! Well, don’t forget to run the tests to check your Pod for proper formatting. Once you’re ready, you can create your distribution:

$ perl Build.PL
$ ./Build test
...
$ ./Build dist
Creating Makefile.PL
Deleting META.yml
Creating META.yml
Creating Task-EffectivePerlProgramming-20100714
Creating Task-EffectivePerlProgramming-20100714.tar.gz
Deleting Task-EffectivePerlProgramming-20100714

You can upload your distribution to CPAN, put it in your private module repository, email it to your coworkers, or pass it around however you like.
If it’s on CPAN, you can install it as you would any other module:

$ cpan Task::EffectivePerlProgramming

Installing the task from a local file

You don’t have to upload your Task distribution to CPAN to use it. You can install the prerequisites of a distribution as long as you have that distribution. You could

In the cpan command-line tool, you can install the current directory (just the dot), including its dependencies:

$ cd Task-EffectivePerlProgramming
$ cpan .

The cpanminus command-line tool can install just the dependencies without installing the module inself (but that’s not necessarily a good thing):

$ cd Task-EffectivePerlProgramming
$ cpanm --installdeps .

Since this Task used Module::Build, you can use the installdeps build target, although you have an extra step now:

$ cd Task-EffectivePerlProgramming
$ perl Build.PL
$ Build installdeps

Bundles are still good

Task distributions take the place of CPAN.pm’s auto-bundles, mostly, but miss a key feature of Bundles, which can install modules given a local file path instead of just a repository path.

You can generate an autobundle with CPAN.pm. The output lists the currently installed versions, the latest version in CPAN, and the relative path in CPAN-like archive. At the end, it writes the actual auto-bundle file:

$ cpan -a
...
Net::POP3                      2.29      2.29  GBARR/libnet-1.22.tar.gz
Net::SMTP                      2.31      2.31  GBARR/libnet-1.22.tar.gz
Net::SSL                       2.84      2.85  NANIS/Crypt-SSLeay-0.58.tar.gz
...
warnings                       1.06      1.11  JESSE/perl-5.13.9.tar.gz
warnings::register             1.01      1.02  JESSE/perl-5.13.9.tar.gz

Wrote bundle file
    /Users/Buster/.cpan/Bundle/Snapshot_2011_08_16_00.pm

The bundle file has some metadata and a list of packages and versions:

package Bundle::Snapshot_2011_08_16_00;

$VERSION = '0.01';

1;

__END__

=head1 NAME

Bundle::Snapshot_2011_08_16_00 - Snapshot of installation on Buster on Tue Aug 16 15:05:40 2011

=head1 SYNOPSIS

perl -MCPAN -e 'install Bundle::Snapshot_2011_08_16_00'

=head1 CONTENTS

Algorithm::Diff 1.1902

Apache::DBI 1.08

Apache::Perldoc 1.11

Apache::Test 1.31

App::Cache 0.36

App::Cmd 0.301

App::Module::Lister 0.13

App::cpanminus 1.0004

AppConfig 1.66

This auto-bundle is a list of everything you have installed, but it doesn’t have to be. You can list any packages and versions that you like. It’s completely up to you to decide what should be part of the task.

Putting this anywhere in your @INC (such as the current working directory) allows you to use the cpan tool to install it:


$ cd /Users/Buster/.cpan
$ cpan Bundle::Snapshot_2011_08_16_00

CPAN.pm handles Bundles specially to handle this.

Since the auto-bundle is a single file (although in a directory named Bundle), it might be easier for you to pass around. Either way, you have a way to tell people everything that should install.

Some interesting Tasks

Some Task distributions cover big topics, such as:

There are several Task::BeLike:: distributions that list people’s favorite tools and modules. If you want to do the same sorts of things as these people, you might want to install their BeLike distributions:

Things to remember

  • Collect project dependencies in a Task distribution
  • Install the Task distribution like any other module distribution, or directly from the local directory
  • Several tasks already exist on CPAN.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Group tests by their task with Test::More’s subtest()

In the earlier Item, Understand the Test Anywhere Protocal (TAP), you saw the very basics of that simple, line-oriented test report. You ran a single test and it output a single line to denote the status of the test, and possibly some diagnostic information. The TAP, however, didn’t organize any of the tests for you.

On the grand scale, you can separate tests for different parts of a system into separate files. However, even in those separate test files, you probably want to group various tests together as a logical unit.

Consider this test script, which I extracted from the HTML-SimpleLinkExtor-1.23 distribution, one of my CPAN offerings. I’m not proud of this test file, but it is what it is, and I’ll improve it:

use File::Spec;
use Test::More 'no_plan';

use_ok( "HTML::SimpleLinkExtor" );
ok( defined &HTML::SimpleLinkExtor::schemes, "schemes() is defined" );

my $file = 't/example2.html';
ok( -e $file, "Example file is there" );

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 

{
my $p = HTML::SimpleLinkExtor->new;
ok( ref $p, "Made parser object" );
isa_ok( $p, 'HTML::SimpleLinkExtor' );
can_ok( $p, 'schemes' );

$p->parse_file( $file );

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
{
my @links = $p->schemes( 'http' );
my $links = $p->schemes( 'http' );

is( $links, 7, "Got the right number of HTTP links" );
is( scalar @links, $links, "Found the right number of links" );
}

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
{
my @links = $p->schemes( 'https' );
my $links = $p->schemes( 'https' );

is( $links, 2, "Got the right number of HTTPS links" );
is( scalar @links, $links, "Found the right number of links" );
}

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
{
my @links = $p->schemes( 'ftp' );
my $links = $p->schemes( 'ftp' );

is( $links, 1, "Got the right number of FTP links" );
is( scalar @links, $links, "Found the right number of links" );
}

}

There’s a few things you’ll notice about this test. First, there are some tests that are really meta-tests to check various things before the real tests start. Before you test a module, you want to check that it loads and perhaps that the method you want to test is there. You might be surprised how many times I’ve wasted time looking for the problem with a method when I had one name in the module and a different name in the tests, never noticing the dissonance. As such, I test for the right method name now. This test also needs an input file, so I ensure that it exists before I start:

use_ok( "HTML::SimpleLinkExtor" );
ok( defined &HTML::SimpleLinkExtor::schemes, "schemes() is defined" );

my $file = 't/example2.html';
ok( -e $file, "Example file is there" );

This particular test file exercises the schemes method that returns links for the give protocol. Before I can test that, I need the object, so there are some tests for that:

my $p = HTML::SimpleLinkExtor->new;
ok( ref $p, "Made parser object" );
isa_ok( $p, 'HTML::SimpleLinkExtor' );
can_ok( $p, 'schemes' );

Notice that these tests are at the start of a scope created by a naked block. That parser in $p only exists in that block. After I call parse, I have a series of naked blocks to define scopes for tests that I logically group. In the TAP output, I can tell what’s going on by the labels:

$ perl5.14.1 -Mblib t/schemes.t
ok 1 - use HTML::SimpleLinkExtor;
ok 2 - schemes() is defined
ok 3 - Example file is there
ok 4 - Made parser object
ok 5 - The object isa HTML::SimpleLinkExtor
ok 6 - HTML::SimpleLinkExtor->can('schemes')
ok 7 - Got the right number of HTTP links
ok 8 - Found the right number of links
ok 9 - Got the right number of HTTPS links
ok 10 - Found the right number of links
...
ok 28 - Found the right number of links
1..28

However, this doesn’t really group the tests. I have to separate themselves because I choose the labels.

Version 0.94 of Test::More adds support for nested TAP, which I did not cover in Understand the Test Anywhere Protocal. This enhancement to TAP allows me to group many tests as part of one larger test. The subtest function in Test::More handles it for me. The subtest takes a label and a code reference. In this case, you use an anonymous subroutine that runs additional tests:

use Test::More;

subtest 'Roscoe tests' => sub {
	use_ok( 'CGI' );
	pass( 'First post' );
	};

subtest 'Buster tests' => sub {
	use_ok( 'CGI' );
	pass( 'First post' );
	fail( 'Oops' );
	};

subtest 'Mimi tests' => sub {
	can_ok( 'CGI', 'head' );
	pass( 'First post' );
	};

done_testing();

Notice the semicolon after each subtest statement. That anonymous subroutine definition is an expression, so you need a statement separator, just like you need for a do block.

Nested TAP works by indented each level of nested tests. If all of the nested tests pass, it prints an outdented line that is the summary for that group. The overall test passes if all of its subtests pass, and fails if any of its subtests fail:

$ perl5.14.1 nested.t
    ok 1 - use CGI;
    ok 2 - First post
    1..2
ok 1 - Roscoe tests
    ok 1 - use CGI;
    ok 2 - First post
    not ok 3 - Oops
    #   Failed test 'Oops'
    #   at nested.pl line 11.
    1..3
    # Looks like you failed 1 test of 3.
not ok 2 - Buster tests
#   Failed test 'Buster tests'
#   at nested.pl line 12.
    ok 1 - CGI->can('head')
    ok 2 - First post
    1..2
ok 3 - Mimi tests
1..3
# Looks like you failed 1 test of 3.

Notice that each subtest group gets its own plan, test count, and indented diagnostics. The overall summary only counts the subtest groups. The summary for each subtest group shows up after the subtests (they have to run first, after all).

Going back to my HTML::SimpleLinkExtor test, I can reorganize the test into groups within subtest groups that label them by their logical task. I have basic sanity checks to ensure I have everything I need to test (“Sanity check”). I have the basic setup to start the tests (“Create object”). Once I have all of those, I have tests grouped by the scheme I want to check (“HTTP” and so on). There are three levels of nested TAP here:

use File::Spec;
use Test::More 0.96;

my $file = 't/example2.html';

subtest 'Sanity check' => sub {
	use_ok( "HTML::SimpleLinkExtor" );
	ok( defined &HTML::SimpleLinkExtor::schemes, "schemes() is defined" );

	ok( -e $file, "Example file is there" );
	};

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

subtest 'No base' => sub {
	my $p;

	subtest 'Create object' => sub {
		$p = HTML::SimpleLinkExtor->new;
		ok( ref $p, "Made parser object" );
		isa_ok( $p, 'HTML::SimpleLinkExtor' );
		can_ok( $p, 'schemes' );
		$p->parse_file( $file );
		};

	subtest 'All links' => sub {
		my @links = $p->links;

		is( scalar @links, 26, "Found the right number of links" );
		};

	# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
	subtest 'HTTP' => sub {
		my @links = $p->schemes( 'http' );
		my $links = $p->schemes( 'http' );

		is( $links, 7, "Got the right number of HTTP links" );
		is( scalar @links, $links, "Found the right number of links" );
		};

	...
	};

If something fails in that test, the indented diagnostics and subtests label help me figure out what logical task failed. In this case, testing the “No base” case for the “All links” test had a problem even though the absolute amount of output is much larger:

$ make test
/Users/brian/bin/perls/perl5.14.1 "-MTest::Manifest" "-e" "run_t_manifest(0, 'blib/lib', 'blib/arch',  )"
t/compile.t ......... ok
t/pod.t ............. skipped: Test::Pod 1.00 required for testing POD
t/pod_coverage.t .... skipped: Test::Pod::Coverage required for testing POD
t/parse.t ........... ok
t/tags.t ............ ok
t/schemes.t ......... 1/?
        #   Failed test 'Found the right number of links'
        #   at t/schemes.t line 30.
        #          got: '26'
        #     expected: '27'
        # Looks like you failed 1 test of 1.

    #   Failed test 'All links'
    #   at t/schemes.t line 31.
    # Looks like you failed 1 test of 7.

#   Failed test 'No base'
#   at t/schemes.t line 78.
# Looks like you failed 1 test of 4.
t/schemes.t ......... Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/4 subtests

Since I recently had to update HTML::SimpleLinkExtor and its tests were almost already grouped into subtests, I converted them to use subtest. If you look at HTML-SimpleLinkExtor-1.24 or greater, you’ll see more examples of subtest.

Things to remember

  • Use subtest to group related tests by task
  • Nested TAP works by successive indenting of the TAP stream
  • Each nested section has its own plan and test count

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Turn off autovivification when you don’t want it

Autovivification, although a great feature, might bite you when you don’t expect it. I explained this feature in Understand autovivification, but I didn’t tell you that there’s a way to control it and even turn it off completely.

The autovivification pragma, which you can get from CPAN, lets you decide how autovivification works, or doesn’t work. As you saw in Understand autovivification, there are several ways that autovivification comes into play, specifically when you dereference an undefined value to:

  • fetch a value
  • store a value
  • test an undefined value with exists
  • delete an element

The pragma lets you control each of these independently.

The simplest use is simply unimport the module with no. This gets a bit tricky because this pragma is more interesting for what you disallow rather than enable:

no autovivification;

With no arguments, this turns off autovivification for fetching a value, or using delete or exists. This does not turn off autovivification for storing a value. You’re more likely to make mistakes with those first three, and most likely to get what you want with the last one.

None of these work, in that they don’t create any more of the data structure:

no autovivification;

my $cats;
delete $cats{'Buster'};
my $foo = $cats{'Buster'}; 

if( exists $cats{'Buster'} ) { 1 }

use Data::Dumper;
print Dumper( $cats );

You don’t get an error or a warning, but nothing happens to $cats, which is still undefined at the end. The pragma just leaves the undefined value as it is and your program continues. In those cases, you still get the answers that expect. When the element isn’t there, you still get undef when you try to access it. It stills returns false when you try exists and the element isn’t there after a exists. Your program continues without a warning.

If you want to know when you’ve skipped at possible autovivification, you can also tell autovivification to warn you, or if you’re really paranoid, kill your program.

Including the warn option gives you warnings. As with any import list, when you specify something you override all defaults. If you still want those defaults, you have to list them explicitly:

no autovivification qw(fetch delete exists warn);

use Data::Dumper;

my $cats;

delete $cats->{Buster};

print Dumper( $cats );

Now you get a warning:

Reference was vivified at auto.pl line 7.
$VAR1 = undef;

If you want to stop the program, you use strict instead of warn:

no autovivification qw(fetch delete exists strict);

use Data::Dumper;

my $cats;

delete $cats->{Buster};

print Dumper( $cats );

Your program stops at the point that Perl tries to autovivify something and gives you an file and line where autovivification stopped your program:

Reference vivification forbidden at auto.pl line 7.

When you store a value, however, autovivification doesn’t get in your way, even if you create a deep one:

no autovivification;

use Data::Dumper;

my $cats;

$cats->{Buster}{count} = 9;

print Dumper( $cats );

You still autovivified the structure:

$VAR1 = {
          'Buster' => {
                        'count' => 9
                      }
        };

If you want to turn off autovification for storing a variable, you have to explicitly disallow it. As before, you have to provide the other default options that you’d like to keep:

no autovivification qw(fetch delete exists store strict);

use Data::Dumper;

my $cats;

$cats->{Buster}++;

print Dumper( $cats );

Now, when you try to autovivify a hash reference to store a value in the Buster key, your program stops (because you are also using strict:

Reference vivification forbidden at auto.pl line 7.

Lexical scope

You can limit the effect of autovivification pragma to part of your program by using it in only in the scope where you need it. There are two ways that you can do this, depending on what you need.

First, you can turn off all autovivification for the entire file (which is itself a scope, see Know what creates a scope), but re-enable it for the bits where you need it:

no autovivification qw(fetch delete exists store strict);

# can't do it in here

{
use autovivification qw(store);

...; # allow some of it in here
}

# can't do it in here, either

Second, you can leave Perl’s normal autovivification features in place for the entire file, but turn it off in the sensitive parts of the program, just as in Item 100. Use lexical warnings to selectively turn on or off complaints (but in this case, engaging safety features):

...; # program with possible autovivification

{
# can't do it in here
no autovivification qw(fetch delete exists store strict);

}

...; # program with possible autovivification

Things to remember

  • The autovivification pragma lets you disable Perl’s automatic data structure creation
  • The defaults do not prevent you from assigning to an autovivified reference
  • You can enable warnings or errors to catch autovivification events

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Modify XML data with XML::Twig

If you need to deal with XML, first, we’re very sorry. Maybe you did something wrong if a previous life, such as munging XML with regular expressions. If you do better in this life, perhaps you won’t have to deal with XML in the next one. That right thing might be using XML::Twig, a powerful package for walking an XML tree, each part of which is a twig. For the rest of this Item, I’ll just call the module Twig.

There are two ways you can interact with your XML data using Twig. You can let Twig modify the data as it parses it, or you can parse it and modify afterward. If you have very large data, you might want to keep very little of it in memory, so you modify it as soon as you can and unload (or flush) that part of the data as you move onto the next part. If the data are small, or you need to know all of that data before you make a change, you might want to parse them completely before you start munging.

The basic Twig program, no matter which way you want to change the data, creates a new Twig object, parses the data, and flushes the output:

use XML::Twig;

my $twig = XML::Twig->new( ... );

$twig->parse( *FILE_HANDLE );

...;

$twig->flush;

Based on what you set up in the object, the parse portion can do various things, including adding to and pruning from the XML tree, renaming tags and attributes, replacing values, and almost anything else that you can program (which is just about anything). It also lets you set up non-parsing and non-munging options.

Suppose you have this very simple XML document, which has vague and generic tags. Since the data are names of cats, you want to change the tags to tags:

<?xml version="1.0"?>
<root>
	<item>Buster</item>
	<item>Mimi</item>
	<item>Roscoe</item>
	<item>Ginger</item>
	<item>Ella</item>
</root>

Start small and build up what you want. The easiest Twig problem does nothing. In this case, you use the basic structure and take the data from standard input.

use XML::Twig;

my $twig = XML::Twig->new(
	);

$twig->parse( *STDIN );
$twig->flush;

Twig processes the input, and since you didn’t tell Twig to do anything, it doesn’t. When you flush, you see the same XML data, although with the insignificant whitespace removed:

<?xml version="1.0"?>
<root><item>Buster</item><item>Mimi</item><item>Roscoe</item><item>Ginger</item><item>Ella</item></root>

That’s a bit annoying. Figuring out how to fix this is also a bit annoying because there are so many ways that you can configure Twig. In this case, you might want to enable pretty-printing and use the indented format:

use XML::Twig;

my $twig = XML::Twig->new(
	pretty_print => 'indented',
	);

$twig->parse( *DATA );
$twig->flush;

Now the output comes back a bit nicer, although still slightly different, but looks less of a mess:

<?xml version="1.0"?>
<root>
  <item>Buster</item>
  <item>Mimi</item>
  <item>Roscoe</item>
  <item>Ginger</item>
  <item>Ella</item>
</root>

Munge as you parse

Now that you know the basic structure of a Twig program, you can move on to the real task, changing those item tags to cat tags instead. In this section, you’ll transform those tags as you parse them.

You can set handlers to transform the data happens in handlers which you can attach to tags. The new takes a twig_handlers key that has a hash reference as an argument:

use XML::Twig;

my $twig = XML::Twig->new(
	pretty_print  => 'indented',
	twig_handlers => {
		...   # handlers go here
		},
	);

$twig->parse( *DATA );
$twig->flush;

Each key in the twig_handlers hash is the tag name you want to handle and the value is a reference to a subroutine that gets that part of the data in $_. Anything you do only affects that part of the data. Twig provides many (many!) methods for access and munging data, and in this case, you use the set_tag method to change the name of the tag:

use XML::Twig;

my $twig = XML::Twig->new(
	pretty_print => 'indented',
	twig_handlers => {
		item => sub { $_->set_tag( 'cat' ) },
		},
	);

$twig->parse( *DATA );
$twig->flush;
<?xml version="1.0"?>
<root>
  <cat>Buster</cat>
  <cat>Mimi</cat>
  <cat>Roscoe</cat>
  <cat>Ginger</cat>
  <cat>Ella</cat>
</root>

You don’t like the root name either, so you can change that to animals with another handler:

use XML::Twig;

my $twig = XML::Twig->new(
	pretty_print => 'indented',
	twig_handlers => {
		item => sub { $_->set_tag( 'cat' ) },
		root => sub { $_->set_tag( 'animals' ) },
		},
	);

$twig->parse( *DATA );
$twig->flush;

Now the data aren’t so generic:

<?xml version="1.0"?>
<animals>
  <cat>Buster</cat>
  <cat>Mimi</cat>
  <cat>Roscoe</cat>
  <cat>Ginger</cat>
  <cat>Ella</cat>
</animals>

Great. That was easy. Go one step further though. Suppose you have a hash of microchip numbers for each animal and you want to add that to your data as an attribute of cat, but only if the cat has an entry in the hash. You can get the data between the opening and closing tag with the text method then use set_att to add attributes and values:

use XML::Twig;

my %microchips = qw(
	Buster  123456
	Mimi    369120
	);

my $twig = XML::Twig->new(
	pretty_print => 'indented',
	twig_handlers => {
		root => sub { $_->set_tag( 'animals' ) },
		item => sub {
			$_->set_tag( 'cat' );
			my $cat = $_->text;
			$_->set_att( microchip => $microchips{$cat} )
				if exists $microchips{$cat};
			},
		},
	);

$twig->parse( *DATA );
$twig->flush;

The data are a bit more fancy now that you’ve combined the microchip data:

<?xml version="1.0"?>
<animals>
  <cat microchip="123456">Buster</cat>
  <cat microchip="369120">Mimi</cat>
  <cat>Roscoe</cat>
  <cat>Ginger</cat>
  <cat>Ella</cat>
</animals>

Finally, you want to remove the cats who have passed away (sadly, rest in peace). You can prune parts of the the tree with delete, which removes that element:

use XML::Twig;

my %microchips = qw(
	Buster  123456
	Mimi    369120
	);

my %deceased = map { $_, 1 } qw(Roscoe);

my $twig = XML::Twig->new(
	pretty_print => 'indented',
	twig_handlers => {
		root => sub { $_->set_tag( 'animals' ) },
		item => sub {
			$_->set_tag( 'cat' );
			my $cat = $_->text;
			$_->delete if exists $deceased{$cat};
			$_->set_att( microchip => $microchips{$cat} )
				if exists $microchips{$cat};
			},
		},
	);

$twig->parse( *DATA );
$twig->flush;

Now the record for Roscoe is missing since you removed it:

<?xml version="1.0"?>
<animals>
  <cat microchip="123456">Buster</cat>
  <cat microchip="369120">Mimi</cat>
  <cat>Ginger</cat>
  <cat>Ella</cat>
</animals>

You’re still not satisfied, though. You want to make each cat’s name live in a new name tag that’s a child of cat. To do that, you can set the text of cat to the empty string, then insert a new element called name whose text is the cat’s name:

use XML::Twig;

my %microchips = qw(
	Buster  123456
	Mimi    369120
	);

my %deceased = map { $_, 1 } qw(Roscoe);

my $twig = XML::Twig->new(
	pretty_print => 'indented',
	twig_handlers => {
		root => sub { $_->set_tag( 'animals' ) },
		item => sub {
			$_->set_tag( 'cat' );
			my $cat = $_->text;
			$_->delete if exists $deceased{$cat};
			$_->set_att( microchip => $microchips{$cat} )
				if exists $microchips{$cat};
			$_->set_text( '' );
			$_->insert_new_elt( 'name', $cat );
			},
		},
	);

$twig->parse( *DATA );
$twig->flush;

Now your data look a bit more interesting:

<?xml version="1.0"?>
<animals>
  <cat microchip="123456"><name>Buster</name></cat>
  <cat microchip="369120"><name>Mimi</name></cat>
  <cat><name>Ginger</name></cat>
  <cat><name>Ella</name></cat>
</animals>

Work with the whole tree at once

There’s another way that you can do this. Instead of defining handlers and modifying the tree as you parse it, you can parse the entire data first and modify afterward. You start with the same basic structure, but with an addition. After you create the twig, you get the tip of the tree with root method. That serves as the starting point for your work:

use XML::Twig;

my %microchips = qw(
	Buster  123456
	Mimi    369120
	);

my %deceased = map { $_, 1 } qw(Roscoe);

my $twig = XML::Twig->new(
	pretty_print => 'indented',
	);
$twig->parse( *DATA );

my $root = $twig->root;

$twig->flush;

The first thing you do is change the root element name, just as you did before with the set_name attribute:

use XML::Twig;

my %microchips = qw(
	Buster  123456
	Mimi    369120
	);

my %deceased = map { $_, 1 } qw(Roscoe);

my $twig = XML::Twig->new(
	pretty_print => 'indented',
	);
$twig->parse( *DATA );

my $root = $twig->root;
$root->set_name( 'animals' );

$twig->flush;

Now you want to move on to the item tags. You have to climb the tree yourself, though. Once you have one element, you can get its children. in this program, you do the same thing, but after you parse the entire XML structure. The children method gets you to the next level of tags:

use XML::Twig;

my %microchips = qw(
	Buster  123456
	Mimi    369120
	);

my %deceased = map { $_, 1 } qw(Roscoe);

my $twig = XML::Twig->new(
	pretty_print => 'indented',
	);
$twig->parse( *DATA );

my $root = $twig->root;
$root->set_name( 'animals' );

foreach my $item ( $root->children( 'item' ) ) {
	$item->set_name( 'cat' );
	my $cat = $item->text;
	$item->set_text( '' );
	$item->insert_new_elt( 'name', $cat );
	$item->set_att( microchip => $microchips{$cat} )
		if exists $microchips{$cat};
	}	

$twig->flush;

There is a lot more that you can do with Twig, which provides a long list of methods to access various parts of the tree and another long list of methods to modify them in interesting ways. Some of these might show up as future Items.

Things to remember

  • Modify XML with XML::Twig, not regular expressions.
  • You can modify the XML data as you parse it with Twig handlers.
  • You can modify the XML data after you parse it by climbing the tree yourself.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Some special Unicode shell aliases to normalize strings

If you are playing with Unicode, you’re probably going to want to convert to the various normalization forms. There are some programs to do this in the Unicode::Tussle distribution, but you can also create some one-liners to do this as well (Item 120. Use Perl one-liners to create mini programs).

If you want to read and write lines, you can use the -n switch to wrap a while loop around your tiny program. In this case, those tiny programs just call a normalization function from Unicode::Normalize. Here are the bash aliases:

alias nfc="perl5.14.1 -MUnicode::Normalize -CS -ne 'print NFC(\$_)'"
alias nfd="perl5.14.1 -MUnicode::Normalize -CS -ne 'print NFD(\$_)'"
alias nfkd="perl5.14.1 -MUnicode::Normalize -CS -ne 'print NFKC(\$_)'"

You can run these as if they were programs with those names. Here you convert those ligature characters, ? (U+FB01) and ? (U+FB02), to their compatible, two-character forms fi and fl as it reads standard input:

$ nfkd
Let's ?nd that ?ying squirrel!
Let's find that flying squirrel!

If you wanted to do it with command line arguments as strings instead of files, it’s a couple small changes. You can add the A flag to the -C switch to interpret the command-line arguments as UTF-8 (unless you want to decode it yourself), and use say to add the newline in the output:

alias nfc="perl5.14.1 -MUnicode::Normalize -CSA -E 'say NFC( qq(@ARGV) )'"
alias nfd="perl5.14.1 -MUnicode::Normalize -CSA -E 'say NFD( qq(@ARGV) )'"
alias nfkd="perl5.14.1 -MUnicode::Normalize -CSA -E 'say NFKC( qq(@ARGV) )'"

The output decomposes the ligatures just as before:

nfkd "Let's ?nd that ?ying squirrel."
Let's find that flying squirrel.

You can read more about these program features in Item 73. Tell Perl which encoding to use and Item 77. Work with graphemes instead of characters.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Fix Test::Builder’s Unicode issue

The perl interpreter is getting much better with its Unicode support, but that doesn’t mean everything just works because most of the code you probably are about is in modules, which might not have kept up. Some of this becomes apparent when you give another module some Unicode strings for it to output.

For instance, the latest release of Test::Builder module has a known issue with filehandle layers (Google Code or Github). Write a short Test::More program and use some wide characters (those with code numbers above 255). Since your source code has these literal wide characters, you need to utf8 to let perl know that it should interpret your source as UTF-8 instead of Latin-1 (Item 72. Use Unicode in your source code):

use Test::More; # 0.98 or less

use utf8;

ok( 1, 'My ? is melting!' );
ok( 1, 'My box has a ?!' );

done_testing;

When you run this, using Test::More 0.98 or lower, you get wide character warnings:

Wide character in print at /usr/local/perls/perl-5.14.1/lib/5.14.1/Test/Builder.pm line 1759.
ok 1 - My ? is melting!
Wide character in print at /usr/local/perls/perl-5.14.1/lib/5.14.1/Test/Builder.pm line 1759.
ok 2 - My ? is melting!
1..2

This is a problem only with the Test::Builder that comes with Test-Simple-0.98 or lower; Test-Simple-2.00 and later, a big rewrite of most of the underpinnings, solves this problem.

You might try various things to turn off those warnings, but you can’t use the warnings pragma because that only affects the state of warnings in the lexical scope, but the warning comes from another file:

use utf8;
use open IO => ':encoding(UTF-8)';

use Test::More; # 0.98 or less

no warnings; # won't work

ok( 1, "My ? is melting!" );
ok( 1, 'My box has a ?!' );

done_testing;

If warnings can’t do the job, the old school Perlers might reach for $^W, the global state of warnings, but that doesn’t work either (and it’s poor practice except in the most extreme cases):

use utf8;
use open IO => ':encoding(UTF-8)';

use Test::More; # 0.98 or less

$^W = 0; # won't work either

ok( 1, "My ? is melting!" );
ok( 1, 'My box has a ?!' );

done_testing;

You might try to make turn on the utf8 on all filehandles (Item 73. Tell Perl which encoding to use):

use Test::More; # 0.98 or less

use utf8;
use open IO => ':encoding(UTF-8)';

ok( 1, "My ? is melting!" );
ok( 1, 'My box has a ?!' );

done_testing;

That appears not to work. Maybe you think that it’s an ordering problem because you load Test::More first. That’s a common thing to do in test scripts because that’s the purpose of the code. So you switch the order, but it still doesn’t work:

use utf8;
use open IO => ':encoding(UTF-8)';

use Test::More; # 0.98 or less

ok( 1, "My ? is melting!" );
ok( 1, 'My box has a ?!' );

done_testing;

You can back up a step back. The problem disappears if you set the filehandles with the -C command-line switch, using the S to set the standard filehandles to UTF-8. However, this only works with at least Perl 5.14, so this won’t work with Perl 5.10 or Perl 5.12. Now you get no warnings:

$ perl5.14.1 snowman.t
Wide character in print at /usr/local/perls/perl-5.14.1/lib/5.14.1/Test/Builder.pm line 1759.
ok 1 - My ? is melting!
Wide character in print at /usr/local/perls/perl-5.14.1/lib/5.14.1/Test/Builder.pm line 1759.
ok 2 - My box has a ?!
1..2

$ perl5.14.1 -CS snowman.t
ok 1 - My ? is melting!
ok 2 - My box has a ?!
1..2

You can also set the PERL_UNICODE environment variable to the empty string, which has the same effect as -CSDL. This is not one of the options that we showed in Item 73. Again, this works only in Perl 5.14:

$ env PERL_UNICODE='' perl5.14.1 snowman.t
ok 1 - My ? is melting!
ok 2 - My box has a ?!
1..2

That is, if perl sets up the filehandles right away, you don’t have a problem. If you wait until the program has started, you’re out of luck. And, if you aren’t using Perl 5.14, you’re out of luck either way.

Fortunately, there’s a very easy fix because Test::Builder, the workhorse behind Test::More is set up in a way that lets you easily fix these sorts of problem (Item 55. Make flexible output and Hide low-level details behind an interface). You can access the builder object that Test::More uses so you can affect them:

use utf8;
use open IO => ':encoding(UTF-8)';

use Test::More; # 0.98 or less

foreach my $method ( qw(output failure_output) ) {
	binmode Test::More->builder->$method(), ':encoding(UTF-8)';
	}

ok( 1, "My ? is melting!" );
ok( 1, 'My box has a ?!' );

done_testing;

Now you get no warnings across the two supported Perl versions and the latest unsupported version:

$ perl5.10.1 snowman.t
ok 1 - My ? is melting!
ok 2 - My box has a ?!
1..2

$ perl5.12.2 snowman.t
ok 1 - My ? is melting!
ok 2 - My box has a ?!
1..2

$ perl5.14.1 snowman.t
ok 1 - My ? is melting!
ok 2 - My box has a ?!
1..2

Once Test::Builder 2.0 is a stable release and most people are using it, you might not have to play these games. However, you can’t always completely control which versions other people use, so you might have to play so version games (similar to Item 83. Limit your distributions to the right platforms):

use utf8;
use open IO => ':encoding(UTF-8)';

use Test::More;

if( Test::Builder->VERSION < 2 ) {
	foreach my $method ( qw(output failure_output) ) {
		binmode Test::More->builder->$method(), ':encoding(UTF-8)';
		}
	}

ok( 1, "My ? is melting!" );
ok( 1, 'My box has a ?!' );

done_testing;

That’s not pretty, but it gets the job done. Fortunately, Test::More gives you a way to do that. For other modules, you might have to play more extreme games.

Things to remember

  • Test::Builder 0.98 and lower has a problem with Perl’s IO layers.
  • Test::Builder 2.0 already fixes this issue, but there isn’t a stable release yet.
  • Access Test::More‘s builder object and set the filehandle layers that you need.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit