Use /gc and \G in matches to separate alternations in separate, smaller patterns

Perl keeps track of the last position in a string where it had a successful global match (using the /g flag). You can access this position with the pos operator. With Perl 5.10, you can use the /p switch to get the per-match variable ${^MATCH} instead of the performance-dampening $&: Continue reading “Use /gc and \G in matches to separate alternations in separate, smaller patterns”

Know the difference between regex and match operator flags

The match and substitution operators, as well as regex quoting with qr//, use flags to signal certain behavior of the match or interpretation of the pattern. The flags that change the interpretation of the pattern are listed in the documentation for qr// in perlop (and maybe in other places in earlier versions of the documentation): Continue reading “Know the difference between regex and match operator flags”

Process XML data with XML::Twig

People often reach for regular expressions to extract and rearrange information in XML documents. Those usually only work for the limited test cases people specifically target, but are really little time-bombs waiting to go off when the data or the format changes even slightly. The bomb often explodes after the original programmer has disappeared. Continue reading “Process XML data with XML::Twig”

Make links to per-version tools

In Item 110: Compile and install your own perls, we showed you how to compile and install several versions of perl so that they don’t conflict with each other and you can use them simultaneously. Since they don’t install their programs, they are left in their $prefix/bin directories. With several perls, each of which has their own modules directories, using tools such as cpan and perldoc can get confusing. Which version of those tools are you using and which perl are they trying to use? Continue reading “Make links to per-version tools”

Avoid accidently creating methods from module exports

Perl’s object system is fuzzy. Methods are really just subroutines and classes are just packages, which means that any subroutine in a package is also a method in that class. Your class might have subroutines that you’ve never even noticed, so you end up with methods that you didn’t want in your interface. Continue reading “Avoid accidently creating methods from module exports”

Know how Perl handles scientific notation in string to number conversions.

A recent question on Stackoverlow asked about the difference between the same floating numbers being stored in scientific notation and written out. Why does 0.76178 come out differently than 7.6178E-01 When Perl stores them, they can come out as slightly different numbers. This is related to the perlfaq answer to Why am I getting long decimals (eg, 19.9499999999999) instead of the numbers I should be getting (eg, 19.95)?, but a bit more involved. You’ll see how to skip the whole mess at the end, but be patient. Continue reading “Know how Perl handles scientific notation in string to number conversions.”

Effective Perl Programming is in Rough Cuts

Rough Cuts (although the book is done and in production). If you already have a Safari Books Online account you should be able to see it in full. If you don’t have an account, convince your employer to buy one for you and your coworkers. You can see some of the book without an account though.

Although you can leave comments for us directly in Rough Cuts, you can also just tell us directly.

Watch out for disappearing strings when you decode

In the Effective Perl class I gave at Frozen Perl last week, I got a question I didn’t have the quick answer to. What happens to the strings when Encode's decode function only partially decodes the string?

The default behavior for decode always decodes the entire string, although it uses substitution character (0xFFFD, which may look like ? on the screen) anywhere that it finds an error in the encoding:

You can change how decode handles problems by supplying a third argument to it, using one of the constants FB_DEFAULT, FB_CROAK, FB_WARN, or FB_QUIET. The FB_DEFAULT uses the substitution character and the FB_CROAK just dies. It’s the other two that are interesting. They stop decoding, either with a warning or without one. Try it yourself:

use 5.010;
use strict;
use warnings;
use Encode qw(decode :fallbacks);

binmode STDOUT, ":utf8";

foreach my $fallback ( qw( FB_DEFAULT FB_CROAK FB_WARN FB_QUIET ) )
	{
	my $fallback_value =  do { no strict 'refs'; &{"$fallback"} };
	
	my $octets  = do { use bytes; "\x41\x42\x43\x61\xCC\x61\x41\x42\x43" };
	my $decoded = eval { decode( 'utf8', $octets, $fallback_value ) };
	say "$fallback: ", show_chars( $decoded ), " [$octets]";
	}


sub show_chars {
	use bytes;
	defined $_[0] ?
		join( ':', map { sprintf "%X", ord } split //, $_[0] )
			:
		'undefined';
	}

The string you’re using is "\x41\x42\x43\x61\xCC\x61\x41\x42\x43". It’s “ABCa.aABC” where that "\xCC" in the middle is an error. It’s the starting a combining character but it doesn’t have a valid octet following it. When you print it, it looks a bit odd (ABCaÃŒaABC) because Perl is treating it as bytes since you used use bytes; in the scope that you created it.

The output shows the fallback type, the characters (in hex separated by colons), and in the braces, the value of $octets after the operation:

FB_DEFAULT: 41:42:43:61:FFFD:61:41:42:43 [ABCaÃŒaABC]
FB_CROAK: undefined [ABCaÃŒaABC]
FB_WARN: 41:42:43:61 [ÃŒaABC]
FB_QUIET: 41:42:43:61 [ÃŒaABC]
utf8 "\xCC" does not map to Unicode at ...

In the FB_DEFAULT case, the \xCC turned into the substitution character, \xFFFD. Notice that the split // worked on characters, so the two-byte substitution character has four letters in the hex representation.

In the FB_CROAK case, the decode dies, the return value is undef, and $octets stays the same. decode doesn’t mess with the argument at all.

Both FB_WARN and FB_QUIET do the same thing, although FB_WARN whines about it. They each tell decode to handle as much of the string as it can. When it finds an error, it returns what it had so far (represented by 41:42:43:61, which is ABCa). However, it also removes that part from the input string, leaving only the part of the string from the error onward. This gives you a chance to examine the string where decode left off so you can decide what to do on your own. You might take off offending bits and start the processing again.

It’s documented that decode changes its input, but not right next to the main documentation for that function. You have to read the “Handling Malformed Data” section later in the Encode docs.

You might notice the problem if you try to decode a string literal:

use Encode qw(decode :fallbacks);
my $decoded = decode( 'utf8', "\x61\xCC\x61", FB_WARN );

You get the error about modifying a read-only value:

Modification of a read-only value attempted ...

If you don’t want decode to mess with your argument, you can use a bitmask to adjust the fallback value. decode looks for the LEAVE_SRC bit to be set (and it only matters for FB_WARN and FB_QUIET), so just OR it away:

use Encode qw(decode :fallbacks LEAVE_SRC);
my $decoded = decode( 'utf8', "\x61\xCC\x61", FB_WARN | LEAVE_SRC );

If you want to keep the original octet sequence, save a copy before you pass it to decode.