Use lookarounds to split to avoid special cases

There are some regular expression tricks that can help you deal with balanced delimiters in a string. The split command takes a pattern, removes the parts of a string that match that pattern, and give you a list of the parts of the string between those separators. Said another way, split works when the parts you don’t need are between the values.

Single character separators are easy:

use v5.10;

my @letters = split /:/, 'a:b:c:d:e';
say "@letters";

The list comes out just as you expect:

a b c d e

Even multiple or variable width patterns are fine:

use v5.10;

my @cats = split /\s+/, 'Buster
	Mimi     Roscoe';
say "@cats";

The list comes out just as you expect:

Buster Mimi Roscoe

It gets more tricky when you have balanced delimiters, when there’s something that marks the start and the end of a value. The problem is that there is something in front of the first element and something after the last element. You can’t split on the pattern of characters between the values because you don’t remove everything:

use v5.10;

my @cats = split /></, '<Buster><Mimi><Roscoe>';
say "@cats";

The first and last delimiter characters are still attached to their values:

<Buster Mimi Roscoe>

You might be tempted to live with that and process those values after the split:

use v5.10;

my @cats = split /></, '<Buster><Mimi><Roscoe>';
$cats[0] =~ s/<//;
$cats[-1] =~ s/>//;
say "@cats";

Some people might be satisfied with that, and it does work, but it’s much better to remove the special cases. If you limit yourself to matching just the character that you want to remove, you’re a bit limited. One problem is the empty leading field that you get if you try to match the first delimiter character:

use v5.10;

my @cats = split /><|\A<|>\z/, '<Buster><Mimi><Roscoe>';

say "@cats";

There’s a space at the beginning of the output because there’s an empty leading field, but the list at least doesn’t have any of the delimiter characters:

 Buster Mimi Roscoe

To fix this, you still need to handle the leading field, perhaps by shifting it off. Again, this works, even if it’s unsightly:

use v5.10;

my @cats = split /><|\A<|>\z/, '<Buster><Mimi><Roscoe>';
shift @cats;

say "@cats";

The special processing isn’t as bad, but you have to remember to handle that one element.

Instead of matching characters, you can use lookarounds to split on the the middle of the balanced delimiter by using a zero-width assertion. The lookarounds match a condition in the string but do not consume any characters. These are conditions in the string, not characters to match.

If you use a lookbehind next to a lookahead, you can split on the position in the string where both conditions match. You want to match in the middle of a >< so the > ends up with the preceding element and the < stays with the succeeding element.

The positive lookbehind has the general form (?<=PATTERN). That pattern, which must be fixed-width, must match before the position. In this case, you want to match a > before the position, so the assertion is (?<=>).

The positive lookahead is almost the same thing, with the form (?=PATTERN). You want to match a < after the position, so your assertion is (?=<).

Putting them together, the lookbehind next to the lookahead, splits the values:

use v5.10;

my @cats = split /(?<=>)(?=<)/, '<Buster><Mimi><Roscoe>';

say "@cats";

The output list still has the delimiter characters, but now each element needs the same processing, so there are no special cases:

<Buster> <Mimi> <Roscoe>

Once you have the values in their own elements, you can remove the delimiters:

use v5.14;

my @cats =
	map { s/\A<|>\z//rg }    # return the modified value
	split /(?<=>)(?=<)/,
	'<Buster><Mimi><Roscoe>';

say "@cats";

That might seem a bit silly, but we’re only using a simple example to illustrate the point.

Consider a slightly more complicated case, where the fields are quoted, but then separated by commas. Unless your learning to re-invent the wheel (a valid exercise to sharpen your skills), you should probably use a module (Item 115. Don’t use regular expressions for comma-separated values). For this example, you’ll do it yourself:

use v5.10;

my @cats =
	split /(?<="),(?=")/,
	'"Buster","Mimi","Roscoe"';

say "@cats";

This removes the commas, as long as they are between quotes. However, you leave the quotes in place so you don't treat the first and last values specially:

"Buster" "Mimi" "Roscoe"

To get rid of the quotes, you process each item in the same way:

use v5.14;

my @cats =
	map { s/\A"|"\z//rg }       # return the modified value
	split /(?<="),(?=")/,
	'"Buster","Mimi","Roscoe"';

say "@cats";

You might try to construct a more complicated regular expression to also remove the quotes, but that's going to be harder to read and maintain than doing it in two simple steps.

Things to remember

  • You don't have to remove delimiters in one step
  • You can use a lookbehind next to a lookahead to specify a position in a string

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Understand why you probably don’t need prototypes

You should understand how Perl’s prototypes work, not so you’ll use them but so you won’t be tempted to use them. Although prototypes can solve some problems, they don’t solve the problems most people want.

Some languages, such as C, have function prototypes. You tell your function how many arguments it has and what sort they are, as well as what type of thing it returns:

char* some_function( int start, int length );

Java has a method signature:

public class SomeClass {
   public String mySubstr( String aString, int i, int j ) {
     ...
   }
 }

Aside from syntax-warping modules, such as Devel::Declare, or source filters, such as Filter::Simple, Perl doesn’t have that as a fundamental feature. It’s subroutines and methods take lists and return lists (or a single item).

Perl does have prototypes as a compile-time aid, documented in perlsub. It’s not there to ensure you give a subroutines particular sorts of arguments but to help the compiler figure out what you typed and how you want it to interpret it. Perl doesn’t require you to surround your arguments in parentheses, so prototypes gives you a way to tell the compiler where the arguments start and end. Consider these examples, which you’ll understand by the end of this Item:

use utf8;

my $value = ? +1;
my @array = ( sin ?*5/2, cos ?, 1, 2, 3 );

A lesser part of that includes the specification of Perl core types (scalar, array, hash, subroutine, or globs). Method calls, which require the parentheses to surround their arguments, completely ignore prototypes because Perl doesn’t need any help to parse them.

You (optionally) specify the prototype after the subroutine name. The simplest prototype is the empty prototype, meaning the subroutine takes no arguments:

use utf8;

sub TRUE  () { 1 }
sub FALSE () { 0 }
sub ?     () { 3.1415926 }

The simplest prototype

The empty prototype tells perl not to consume any arguments when it sees that function name. How does perl know how to interpret this?

use utf8;

say ? +1;

Since ? is a subroutine, it can take arguments. Since you wrote it without parentheses, perl needs a hint to parse that. It could be two forms, each with possibly different answers:

use utf8;

say ?( +1 );
say ?() + 1;

The empty prototype tells perl to parse it as ?() + 1. This makes the empty prototype a way that you can declare constants.

This means, however, that perl needs to know about the prototype before it parses that bit of code. This works because the prototype shows up first because the subroutine is completely defined before it’s called:

use utf8;

sub ? () { 3.1415926 }
say ? +1;   # 4.1415926

You don’t need to define the subroutine ahead of time, but you have to declare its prototype to get the behavior that you expect:

use utf8;

sub ? ();
say ? +1;   # 4.1415926

BEGIN {
*? = sub { 3.1415926 }
}

This isn’t a recommendation to write code like this, but it illustrates the point. Even with warnings turned off, you’ll still get a warning about the mismatch in prototypes:

use v5.10;
use utf8;

sub ? ();
say ? +1;   # 4.1415926

BEGIN {
*? = sub { 3.1415926 }
}

You can make the prototypes in the forward definition and the full definitions match:

use v5.10;
use utf8;

sub ? ();
say ? +1;   # 4.1415926

BEGIN {
	*? = sub (){ 3.1415926 }
	}

The prototype matters only for the calls to the subroutines after its definition. The prototype doesn’t matter for subroutine calls before its definition. However, to use the subroutine as a bareword, you still have to have a forward declaration:

use v5.10;
use utf8;

sub ?;
say ? +1;   # 3.1415926

sub ? ();
say ? +1;   # 4.1415926

BEGIN {
	*? = sub (){ 3.1415926 }
	}

Not only that, we can change the prototype as perl parses your program.

use v5.10;
use utf8;

sub ?;
say ? +1;

sub ? ();
say ? +1;   # 4.1415926

sub ? ($$);
say ? +1;   

BEGIN {
	*? = sub (){ 3.1415926 }
	}

The sub ? ($$) tells perl to expect two arguments for the subsequent calls. Since you don’t give it enough arguments

Prototype mismatch: sub main::? () vs ($$) at proto.pl line 10.
Not enough arguments for main::? at proto.pl line 11, near "1;"
BEGIN not safe after errors--compilation aborted at proto.pl line 15.

More than zero arguments

To take more or more arguments, you just specify that number of items in the prototype. So far, you’ve only seen scalar arguments, which you specify as a $ in the prototype:

sub twofer    ($$);    # exactly two arguments
sub hat_trick ($$$);   # exactly three arguments

This does not mean that the subroutine gets that number of arguments. It does not mean that it takes two scalar variables as arguments. Perl parses each argument in scalar context:

use v5.10;

sub twofer ($$) { say "@_" };

my @array = qw( Buster Mimi Roscoe );
twofer @array, 2;

my %hash = map { $_ => 1 } 'a' .. 'z';
twofer %hash, 2;

The @array is a single argument, the first one, and is taken in scalar context, giving the number of elements in it. The %hash is treated in the same way, providing the mostly useless “hash statistics” scalar value:

3 2
19/32 2

Putting a \ in front of a prototype character specifies that the argument is a named variable. Instead of the value, you get a reference to the value:

use v5.10;

sub twofer (\$) { say "@_" };

my $scalar = 'Buster';
twofer $scalar;

The output shows the reference:

SCALAR(0x10082e548)

If you try to give it a non-variable, perl complains at compile-time:

Type of arg 1 to main::twofer must be scalar (not constant item) at proto.pl line 6, near "2;"
Execution of proto.pl aborted due to compilation errors.

A non-backslashed @ in a prototype specifies a list and forces list context on the rest of the arguments. It does not require and array argument:

use v5.10;

sub twofer (@) { say "@_" };

my @array = qw( Buster Mimi Roscoe );
twofer @array, 2;

my %hash = map { $_ => 1 } 'a' .. 'z';
twofer %hash, 2;

The output shows the normal list flattening behavior you expect from a Perl subroutine call. Notice that it does not care about the number or type of arguments:

Buster Mimi Roscoe 2
w 1 r 1 a 1 x 1 d 1 j 1 y 1 u 1 k 1 h 1 g 1 f 1 t 1 i 1 e 1 n 1 v 1 m 1 s 1 l 1 c 1 p 1 q 1 b 1 z 1 o 1 2

If you wanted to keep the array together, you would put a backslash in front of the \@. The argument must be a named array, and not an anonymous array or a reference to an array. Even though the argument is an array, the value in @_ will be a reference to that array:

use v5.10;

sub twofer (\@$) { say "@_" };

my @array = qw( Buster Mimi Roscoe );
twofer @array, 2;

The output shows two arguments:

ARRAY(0x100827810) 2

You can’t then sneak in a scalar variable or a hash variable:

use v5.10;

sub twofer (\@$) { say "@_" };

my @array = qw( Buster Mimi Roscoe );
twofer @array, 2;

my %hash = map { $_ => 1 } 'a' .. 'z';
twofer %hash, 2;

perl catches that at compile-time:

Type of arg 1 to main::twofer must be array (not private hash) at proto.pl line 9, near "2;"
Execution of proto.pl aborted due to compilation errors.

If you want to take more than one type of argument at a particular position, you can specify the possible types in brackets. To take either an array or a hash, you use [@%]:

use v5.10;

sub twofer (\[@%]$) { say "@_" };

my @array = qw( Buster Mimi Roscoe );
twofer @array, 2;

my %hash = map { $_ => 1 } 'a' .. 'z';
twofer %hash, 2;

Now the output takes either:

ARRAY(0x100827810) 2
HASH(0x10082e0c8) 2

If you want to take two separate arrays,

use v5.10;

sub twofer (\@\@) { say "@_" };

my @array1 = qw( Buster Mimi Roscoe );
my @array2 = qw( Ginger Ellie );
twofer @array1, @array2;

You get one reference for each array:

ARRAY(0x100827810) ARRAY(0x10082dff0)

Even though you can specify the variable type with the backslashed form, you can’t specify anything about the values that they hold, or limits to the number of elements they contain.

You can also specify prototypes for subroutines and globs, which we’ll cover in a separate Item since you can have a lot more fun with those.

Optional arguments

So far, you’ve used prototypes that specify an exact number of elements. If you want to specify optional arguments, you can divide the mandatory and optional prototype characters with a semicolon. If you wanted to take at least two but possible three arguments, you’d use the prototype ($$;$)

use v5.10;

sub hat_trick ($$;$) { say "@_" };

hat_trick 'Buster', 'Mimi';
hat_trick 'Buster', 'Mimi', 'Roscoe';

Both of those work just fine, but if you try to give it four arguments, you get a compilation error telling you that there are “Too many arguments”.

If you want a minimum number of arguments, but no maximum, you can use the @ as the optional argument. All of these are fine:

use v5.10;

sub hat_trick ($$;@) { say "@_" };

hat_trick 'Buster', 'Mimi';
hat_trick 'Buster', 'Mimi', 'Roscoe';
hat_trick 'Buster', 'Mimi', 'Roscoe', 'Ginger';

Here are some interesting prototypes from List::MoreUtils:

# from List::MoreUtils
sub each_array (\@;\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@)

sub natatime ($@)

sub mesh (\@\@;\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@)

A final warning

Subroutine prototypes exist chiefly so perl can parse calls to your subroutines just like it would its built-inswithout parentheses. They can set the context for the arguments or the variable types, but they can’t specify the sorts of values. Prototypes aren’t the tools that you want if any of those are your goal. It’s also easier to just use parentheses to mark your argument list.

Things to remember

  • Prototypes are not function signatures
  • Non-backslashed prototype characters enforce a context, not a type
  • Backslashed prototype characters enforce a variable type
  • You can specify optional arguments after a semicolon in the prototype.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Return error objects instead of throwing exceptions

Programmers generally consider two types of error communication: the “modern” and shiny exception throwing, and the old and decrepit return values. When they consider these, they choose one and forsake the other. One is good, and the other is bad. Programmers won’t agree on which is which though.

The return value technique comes from older languages that had no other convenient way to do it:

my $result = some_function( @args );
if( $result ) {
	...     # handle error
	}
...         # continue with program

This has a problem because the values that you want to return for normal operation get mixed up with those that you want to use to signal the error. You could add some buffer argument to fill in the error, but that’s really annoying, especially since Perl doesn’t have function signatures:

my $value = some_function( \$error, @args );
if( $$error ) {
	...     # handle error
	}
...         # continue with program

To get around this, some languages created a third path for error messages. An exception signals a problem, jumps out of the current code context, and hopes that someone handles it (Item 101: Use die to generate an exception). In Perl, you can use an eval to trap this then inspect the $@ variable for the error (Know the two different forms of eval):

my $value = eval { failing_sub( @args ) };
if( my $error = $@ ) {
	...
	}
...         # continue with program

Actually handling an error from an eval is tricky, so some people recommend using something like Try::Tiny, even though behind its interface its doing essentially the same thing:

use Try::Tiny;

try { failing_sub( @args ) }
	catch { ... }    # handle error
	finally { ... }; #fail over

This still isn’t much better, even if it does its best to handle the trickiness of $@. When you compare it to the other examples you’ve seen so far in this Item, you can’t really tell the difference at the syntax level. You call something, then add code to check a value. Exceptions, as an ideal, might have merit. If your language started with them as a core concept (which Perl did not), you probably have the flexibility you need to use them effectively. A language with exception handlers can both handle the error and pick up at the point of the error. In Perl, once you get the exception, you don’t have any way to get back to the spot where you threw the exception. Essentially, you just have a fancy return value. Put a bit more strongly, you have a crude goto that doesn’t even preserve its context.

Most people have failed to consider something Perly instead. The return value and buffer examples are hold-overs from C-like thinking, and the exceptions are object-oriented envy. Or, more correctly, envy for particular implementations of an object-oriented concept.

Since Perl does not have subroutine signatures, you don’t get to declare what you will give to a subroutine or what you get back from a subroutine. You pass it a list, and you get back a list (even if that is one or no items). In a C-like language, you’d return only one kind of thing, which made the return result a problem. You could return a particular type of struct, but that struct has to be the same type for success and failure. You might be able to force that to work, but then every subroutine returns the same struct and you have to translate that into the right values to pass to other routines. Ugh.

Perl doesn’t care what you return, so why not return an error object in case of an error, and anything else otherwise? The fundamental feature of an object is identityan object knows what it is. Unless you get an error object, everything worked. When you get the result, you could look for objects of the right type:

use Scalar::Util qw(blessed);

my $results = some_function( @args );
if( blessed($result[0]) && $result[0]->isa( 'MyError' ) && $result[0]->is_error } ) {
	...     # handle error
	}
...         # continue with program

It’s easier, though, just to assume it’s an object and call the is_error method. If it’s not an object, you just catch the method call on the non-object with an eval, that you don’t need to trap. This also lets you use any object that has the is_error interface:

my @results = some_function( @args );
if( eval{ $result[0]->is_error } ) {
	...     # handle error
	}
...         # continue with program

That error object could get fancy, too, with a given-when (although with a for, as in Use for() instead of given()):

my @results = some_function( @args );
if( eval{ $result[0]->is_error } ) {
	for ( $result[0]->type ) {
		when( 'output' )       { ... }
		when( 'no_database' )  { ... }
		when( 'bad_request' )  { ... }
		default                { ... }
		}
	}

You haven’t seen anything about the error object though, mostly because it doesn’t matter. Indeed, this particular interface doesn’t matter. You don’t need anything fancy. The error class just carries some data around. It doesn’t do anything with the data and it doesn’t interrupt your flow control:

package Local::MyError {

	sub new {
		my( $class, $type, $message ) = @_;

		bless {
			message => $message,
			type    => $type,
			caller  => [ caller(1) ],
			};
		}

	sub is_error { 1 }
	sub type     { $_[0]->{type} }
	}

When you need to communicate a failure, you return the error object:

sub some_function {
	...;
	open my $fh, '>', $filename or return Local::MyError->new( ... );
	...;
	}

Such a class isn’t limited to this particular technique either, so you can get more use out of it. If you still want to use exceptions, you can use the error object with die:

sub some_function {
	...;
	open my $fh, '>', $filename or die Local::MyError->new( ... );
	...;
	}

my @results = eval { some_function( @args ) };
if( my $error = $@ and eval { $error->is_error } ) {
	for ( $error->type ) {
		when( 'output' )       { ... }
		when( 'no_database' )  { ... }
		when( 'bad_request' )  { ... }
		default                { ... }
		}
	}

This is very similar to the example in the autodie documentation:

eval {
	use autodie;

	open(my $fh, '<', $some_file);

	my @records = <$fh>;

	# Do things with @records...

	close($fh);

};

given ($@) {
	when (undef)   { say "No error";                    }
	when ('open')  { say "Error from open";             }
	when (':io')   { say "Non?open, IO error.";         }
	when (':all')  { say "All other autodie errors."    }
	default        { say "Not an autodie error at all." }
}

Things to remember

  • Exceptions aren’t that different than return values
  • You can’t resume execution after throwing an exception
  • You can return an error object to signal failure

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

A Chinese translation of Effective Perl Programming

I mentioned a long time ago that a Chinese translation of Effective Perl Programming was in the works, and apparently it’s done. Someone sent me a copy of the Chinese version of the book. I can’t tell you who did it (if it’s you, let me know) and I don’t know where you can buy it (if you know, let me know). Also, I don’t know what I want to do with the copy that I have. I don’t read Chinese, so I can’t really read the book to see how well it translates, and I don’t want to keep the book as a trophy. Does someone else want the book? Is there a Chinese Perl event that would like to give it away as a prize? I’ll get Josh and I to sign it and send it along.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Use lookarounds to eliminate special cases in split

The split built-in takes a string and turns it into a list, discarding the separators that you specify as a pattern. This is easy when the separator is simple, but seems hard if the separator gets more tricky.

For a simple example, you can split an entry from /etc/password (although getpw* functions will do that for you):

root:*:0:0:System Administrator:/var/root:/bin/sh

The colons separate the fields, so you split on a colon:

my @fields = split /:/, $passwd_line;

That works just fine because the separator is a single character, that character is the same between each field, and the separator character doesn’t appear in any of the data.

A slightly more tricky example has a character from the separator also show up in the data. Consider comma-separated values which also allows a comma in the data. If you really have to do this, you would use a module (Item 115. Dont use regular expressions for comma-separated values). However, this is a good task to illustrate some of the tricks in this Item. You might see these data stored in many ways. You are likely to see all the fields quoted if any one of them has the comma:

"Buster","Roscoe, Cat","Mimi"

You can split on ",", which separates all the fields:

my $string = q("Buster","Roscoe, Cat","Mimi");

my @fields = split /","/, $string;

$" = "\n";
print "@fields\n";

However, the first and last fields have remnants of the quoting:

"Buster
Roscoe, Cat
Mimi"

In this case, the simple split failed because it only removes text between the fields and doesn’t care at all about text at the beginning of the string or the end of the string.

You might think that you can make special cases to handle the beginning and end of the string bits. Creating special cases is almost always what you want to avoid: they make the code more complicated and they make you think about more than you really need to think about. Still, you can do that with alternations in the pattern:

my $string = q("Buster","Roscoe, Cat","Mimi");

my @fields = split /\A"|","|"\z/, $string;

$" = "\n";
print "@fields\n";

And, it doesn’t work. The split maintains leading open fields, so we get an extra field at the start:


Buster
Roscoe, Cat
Mimi

You could handle that by removing the first element, but that’s more duct tape and spit over the other kludge. Not only do you have two special cases in the pattern, but you have a special case in the output.

You don’t have to remove the quotes right away though. You can reduce all the special cases by not matching the quote characters in the split pattern. You can use a lookaround to find the commas surrounded by quotes:

my $string = q("Buster","Roscoe, Cat","Mimi");

my @fields = split /(?<="),(?=")/, $string;

$" = "\n";
print "@fields\n";

The positive lookbehind, (?<=...), is a zero-width assertion. It matches a pattern that exists (hence positive) but doesn't consume the characters it matches. You already know about other zero-width assertions, such as \b and ^. These merely match a condition in the string before the pattern. The positive lookahead, (?<=...), is the same thing, but looks forward of the pattern.

Now all of the fields retain their quotes because the lookarounds do not consume the characters they match, even though they assert those characters must be there:

"Buster"
"Roscoe, Cat"
"Mimi"

You can easily strip off the quotes, handling every element returned by split in the same way:

use v5.14;
my $string = q("Buster","Roscoe, Cat","Mimi");

my @fields =
	map { s/\A"|"\Z//gr }
	split /(?<="),(?=")/, $string;

$" = "\n";
print "@fields\n";

The pattern has no special cases, and the output from split has no special cases. Eliminating special cases reduces the number of things you have to remember and the reduces the likelihood that you'll mess up one of the cases.

Buster
Roscoe, Cat
Mimi

What if the separator where even more complex, with a literal quote mark inside the data? If you can do that, you can imagine a quote character next to a comma in the field:

"Buster","Roscoe "","" Cat","Mimi"

Now you want to split on a comma with quotes around it, but only if it doesn't have two consecutive quotes on either side. You can combine the positive lookarounds with negative lookarounds. The negative versions act the same, but assert that the condition cannot match, just like a \B asserts that the position is not a word boundary:

use v5.14;
my $string = q("Buster","Roscoe "","" Cat","Mimi");

my @fields =
	map { s/"(?=")//gr }
	map { s/\A"|"\z//gr }
	split /(?<!"")(?<="),(?=")(?!"")/, $string;

$" = "\n";
print "@fields\n";

In processing the "", you use another positive lookahead to unescape the doubled double quote character:

Buster
Roscoe "," Cat
Mimi

As a final example, instead of quoted fields, you might see the non-separator comma as an escaped character:

Buster,Roscoe\, Cat,Mimi

In this case, you only want to split on a comma that does not have an escape character before it. You can't use a positive lookbehind because you don't want to match characters before the comma. Instead, you want a negative lookbehind because you want to assert that there are characters that can't appear before the comma. Instead of a =, you use a !:

use v5.14;
my $string = q(Buster,Roscoe\\, Cat,Mimi);

my @fields =
	map { s/\\(?=,)//gr }
	split /(?<!\\),/, $string;

$" = "\n";
print "@fields\n";

Again, you use another positive lookahead, (?=,), in the s/// so you substitution pattern does not match the character that you don't want to replace. Otherwise, you'd have to type the comma twice:

s/\\,/,/gr

You can go even further with these examples, creating much more ugly and complex examples with additional levels of quoting. This should naturally lead you to believe that regular expressions aren't the best tool for this (or at least a single regular expression).

Things to remember

  • If you really have to parse comma-separated values, use a module instead of writing your own patterns
  • Lookarounds assert a condition in the string without consuming any characters
  • The positive lookarounds assert their patterns must match
  • The negative lookarounds assert their pattern must not match
  • Use the lookarounds to eliminate special cases in complex split patterns

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Enchant closures for better debugging output

When you’re using code references heavily, you’re going to have a problem figuring out which one of them is having a problem. You define them in possibly several and far-flung parts of your program, but when it comes to using them, you don’t know which one you are using. You can’t really print its value like you would for a scalar, making it more difficult for you to debug things. You can dereference a scalar or an array to see what it is, but you can’t dereference a code reference without making it do something.

Consider this code, which defines a code reference along with other
references (also see Item 59. Compare reference types to prototypes). You can print the values in most reference types to see what they are, but you can’t do that directly with the code reference:

use v5.10;

my @array = ( \'xyz', [qw(a b c)], sub { say 'Buster' } );

foreach ( @array ) {
	say "$_";
	when( ref eq ref \ ''   ) { say "Scalar $$_" }
	when( ref eq ref []     ) { say "Array @$_" }
	when( ref eq ref sub {} ) { say "Sub ???" }
	}

When you dereference a value, nothing happens (aside from any tie magic). When you dereference a subroutine, you run its code with whatever arguments you give to it. If the subroutine needs arguments, which arguments would you use if you wanted to see what the subroutine would do?

There’s a clever way around this, first noted by Randal Schwartz in his Perlmonks post Track the filename/line number of an anonymous coderef. He proposed a new subroutine, main::Sub to use in place of the sub keyword.

BEGIN {
  package MagicalCoderef;

  use overload '""' => sub {
    require B;

    my $ref = shift;
    my $gv = B::svref_2object($ref)->GV;
    sprintf "%s:%d", $gv->FILE, $gv->LINE;
  };

  sub main::Sub (&) {
    return bless shift, __PACKAGE__;
  }
}

This technique actually made the code reference an object so he could overload stringification.

my $s = Sub { say +shift };
print "$s\n";

This stringified the code reference as the filename and line number where you created it:

/Users/Buster/Desktop/magic_coderef.pl:19

You can go farther than this, though, and make this a bit more useful. In response to Randal’s post, I suggested turning the idea inside-out. Instead of using Sub, I exposed the object creation. That way, you don’t have to worry about which package Sub might be in:

use v5.14;

package MagicalCodeRef 0.90 {
    use overload '""' => sub
        {
        require B;

        my $ref = shift;
        my $gv = B::svref_2object($ref)->GV;
        sprintf "%s:%d", $gv->FILE, $gv->LINE;
        };

    sub enchant { bless $_[1], $_[0] }
    }

You can apply this magic to code references that you already have to get the same result (you could also do this with Sub, but it looks odd):

my $s = MagicalCodeRef->enchant( sub { say +shift } );
print "$s\n";

Still, that’s not good enough. You where where you created the subroutine, but that might not be enough information for you. You can use even more B magic. The B::Deparse module can decompile code to show you what perl thinks it is (we used this briefly in Item 7. Know which values are false and test them accordingly).

use v5.14;

package MagicalCodeRef 1.00 {
    use overload '""' => sub
        {
        require B;

        my $ref = shift;
        my $gv = B::svref_2object($ref)->GV;

		require B::Deparse;
		my $deparse = B::Deparse->new;
		my $code = $deparse->coderef2text($ref);

        my $string = sprintf "---code ref---\n%s:%d\n%s\n---",
        $gv->FILE, $gv->LINE, $code;

        };

    sub enchant { bless $_[1], $_[0] }
    }

With the same bit of code, you get additional output:

my $s = MagicalCodeRef->enchant( sub { say +shift } );
print "$s\n";

The output shows everything that perl thinks it needs to reproduce that behavior, including some pragma settings and compiler hints:

---code ref---
/Users/brian/Desktop/magic:25
{
    use strict 'refs';
    BEGIN {
        $^H{'feature_unicode'} = q(1);
        $^H{'feature_say'} = q(1);
        $^H{'feature_state'} = q(1);
        $^H{'feature_switch'} = q(1);
    }
    print shift();
}
---

The code doesn’t look the same as the code reference you initially created, but at least you have an idea what the code reference does.

If the code reference is a closure, you might also need to know which variables it closed over and what their values are. You can get these from the PadWalker module (which doesn’t come with Perl so you’ll need to get it from CPAN):

use v5.14;

package MagicalCodeRef 1.01 {
    use overload '""' => sub
        {
        require B;

        my $ref = shift;
        my $gv = B::svref_2object($ref)->GV;

		require B::Deparse;
		my $deparse = B::Deparse->new;
		my $code = $deparse->coderef2text($ref);

        require PadWalker;
        my $hash = PadWalker::closed_over( $ref );

		require Data::Dumper;
		local $Data::Dumper::Terse = 1;
        my $string = sprintf "---code ref---\n%s:%d\n%s\n---\n%s---",
        $gv->FILE, $gv->LINE,
        $code,
        Data::Dumper::Dumper( $hash );

        };

    sub enchant { bless $_[1], $_[0] }
    }

Give this new version of MagicalCodeRef a closure:

my $sub = do {
	my( $x, $y ) = qw( Buster Mimi );

	sub { print "$x $y @_" }
	};

my $s = MagicalCodeRef->enchant( $sub );
say $s;

Now you see that which variables in the code reference refer to lexical variables that are out of scope instead of package or special variables. Only the lexical variables show up in the Dumper output:

---code ref---
/Users/brian/Desktop/magic:35
{
    use strict 'refs';
    BEGIN {
        $^H{'feature_unicode'} = q(1);
        $^H{'feature_say'} = q(1);
        $^H{'feature_state'} = q(1);
        $^H{'feature_switch'} = q(1);
    }
    print "$x $y @_";
}
---
{
  '$y' => \'Mimi',
  '$x' => \'Buster'
}
---

Things to remember

  • You can bless code references and overload their stringification to output what you like
  • The B module can tell you the filename and line number where you created the closure
  • The B::Deparse module can decompile a code reference
  • The PadWalker module can give you the closed-over
    lexical variables and their values.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Normalize your Perl source

Perl has had Unicode support since Perl 5.6, which means that most Perl tutorials have been bending the truth a bit when they tell you that a Perl identifier, the name that you give to variables, starts with [A-Za-z_] and continues with [0-9A-Za-z_]. With Unicode support, you have many more characters available to you, but moving outside the ASCII range has some problems. You can’t always tell what a variable name is just by looking at it (and this is a design bug in Perl: RT 96814). For instance, you don’t really don’t know what this variable is:

use utf8;

my $résumé = 'http://www.example.com/resume.html';

If you wanted to use that variable later in your program, what would you type? It seems simple, but Unicode has two ways to represent the é glyph. It has the composed version, (U+00E9 ʟᴀᴛɪɴ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴇ ᴡɪᴛʜ ᴀᴄᴜᴛᴇ), and the decomposed version of two characters, (U+0065 ʟᴀᴛɪɴ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴇ) and (U+0301 ᴄᴏᴍʙɪɴɪɴɢ ᴀᴄᴜᴛᴇ ᴀᴄᴄᴇɴᴛ). Depending on your editor setup, you might not get the thing that you think that you typed, even.

None of this would be a problem is Perl normalized the variable names for you. Every time that you typed é, no matter how you created that glyph, you get the same representation in the source code. However, as of Perl 5.14, Perl does not do this for you. So, it’s a problem.

Consider how the next programmer knows what your variable name is? How many variables are in this script, and do you get a warning? What is the output of this simple program?

use utf8;
use 5.010;

my $é = 'abc';
my $é = '123';

$é = 'XYZ';

say "One char = ", $é;
say "Two char = ", $é;

Now, how about this program?

use utf8;
use 5.010;

my $é = 'abc';
my $é = '123';

$é = 'XYZ';

say "One char = ", $é;
say "Two char = ", $é;

There are two possible programs because there are two possible variables at line 7. You can’t tell just by looking at the source in your editor. Depending on which variable gets the XYZ assignment, you get different outputs:

One char = XYZ
Two char = abc
One char = 123
Two char = XYZ

There’s danger in this Item since you are reading it on the web and various things might have happened to the text as it made its way through databases and web servers and web browsers, any of which may have changed the source. Here’s the program that generates the two possible programs, depending on what time it is:

use 5.010;
use utf8;
use charnames qw(:full);

my $var = time % 2 ?
	"e\N{COMBINING ACUTE ACCENT}"
	:
	"\N{LATIN SMALL LETTER E WITH ACUTE}"; 

binmode STDOUT, ':encoding(UTF-8)';
print <<"PERL";
use utf8;
use 5.010;

my \$e\N{COMBINING ACUTE ACCENT} = 'abc';
my \$\N{LATIN SMALL LETTER E WITH ACUTE} = '123';

\$$var = 'XYZ';

say "One char = ", \$\N{LATIN SMALL LETTER E WITH ACUTE};
say "Two char = ", \$e\N{COMBINING ACUTE ACCENT};
PERL

The source is encoded as UTF-8, but it's unnormalized, meaning that the different ways to represent the same glyph show up in different forms. If someone uses the form that you didn't, they actually use a different variable. Cutting and pasting may not even be safe because that process might normalize it one way or the other. Your editor may normalize it for you (but leaving other parts alone). You need the program to use the same normalization.

The simplest thing is making your editor handle it for you automatically, but if you can't do that, you might have to do it manually.

To change the normalization of a file, you can use the programs that come with Unicode::Tussle:

$ nfc program.pl > program-nfc.pl
$ nfd program.pl > program-nfc.pl

You could also make some Perl one-liners (in bash, in this case):

alias nfc="perl5.14.1 -MUnicode::Normalize -CS -ne 'print NFC(\$_)'"
alias nfd="perl5.14.1 -MUnicode::Normalize -CS -ne 'print NFD(\$_)'"

Beware, though. If you have parts of your file that need to be a particular normalization form, normalizing the entire file might change that. If you expect a string to be in NFD, perhaps to test a Unicode feature, changing the normalization will cause problems:

my $nfd_test_string = 'résumé'; # should be NFD.

However, if it's actually important for you to have a string in a particular form, you should enforce that explicitly instead of relying on the way you (or someone else) dealt with the file. You can force the normalization with Unicode::Normalize's subroutines:

use Unicode::Normalize qw(NFD);
my $nfd_test_string = NFD( 'résumé' ); # should be NFD.

Ideally, you'd handle this as part of your build process from your distribution directory so you don't have to think about it, but it's actually not simple to do that. There are two modules involved: ExtUtils::Install and ExtUtils::Manifest. The first copies files into blib in preparation for testing and installation. The second copies files listed in MANIFEST to a distribution directory. You want to be able to have the right version in both cases, but if you don't have normalized files to start you have some work to do. That's a bit beyond the scope of this Item (and a much longer discussion) that I might cover later.

Things to remember

  • Perl doesn't normalize variable names. It's a bug.
  • Normalize your Perl source one way or the other.
  • If you depend on a particular normalization in a string, force it explicitly.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Intercept warnings with a __WARN__ handler

Perl defines two internal pseudo-signals that you can trap. There’s one for die, which I covered in and eventually told you not to use. There’s also one for warn that’s quite safe to use when you need to intercept warnings.

To catch a warning, you set a signal handler for the __WARN__ pseudo-signal. The underscores around the name distinguish it from the external signals, such as INT and USR1. The value can be the name of a subroutine or a reference to a subroutine:

$SIG{__WARN__} = 'some_sub';
$SIG{__WARN__} = \&some_sub;
$SIG{__WARN__} = sub { ... };

Replacing the default behavior is a good use for a __WARN__ handler. The cluck subroutine from Carp turns the warning message into a backtrace. If you want that for all warnings, you set it up as early as possible:

BEGIN { $SIG{__WARN__} = \&Carp::cluck; }

You don’t need to change all warnings for the entire program, though. If you need to track down the code that triggers the warning, you probably want to limit your replacement behavior to the code you’re investigating:

{
local $SIG{__WARN__} = \&Carp::cluck;
...;
}

Let’s have more fun, though.

Something more fun

You can get more fancy though, because you can do almost anything you like. Here’s a little program that issues several warnings, which at first you wont intercept. This is a nonsense program that only exists to generate warnings, some of which you may have never seen before. Note the use of Perl 5.12 for the completely legal .... Since you never call chomp, the runtime never gets a chance to make those fatal errors even though they compile just fine:

use warnings;
use v5.12;

sub chomp { ... };
*chomp = sub { ... };
chomp( $ARGV[0] );
my $sum;
exec 'Buster';
print (STDOUT), 1, 2, 3;
print $a;
accept( SOCKET, GENERIC );
connect( SOCKET, 'Mimi' );
chmod 777, 'Mimi';
open FOO, '|Buster|';
close FOO;

say 'At the end!';

The warnings are legion:

Ambiguous call resolved as CORE::chomp(), qualify as such or use & at warnings line 8.
print (...) interpreted as function at warnings line 14.
Useless use of a constant (2) in void context at warnings line 14.
Useless use of a constant (3) in void context at warnings line 14.
Statement unlikely to be reached at warnings line 14.
        (Maybe you meant system() when you said exec()?)
Name "main::a" used only once: possible typo at warnings line 15.
Name "main::GENERIC" used only once: possible typo at warnings line 17.
Subroutine main::chomp redefined at warnings line 6.
Use of uninitialized value $ARGV[0] in scalar chomp at warnings line 8.
Can't exec "Buster": No such file or directory at warnings line 11.
Use of uninitialized value $_ in print at warnings line 14.
Use of uninitialized value $a in print at warnings line 15.
accept() on unopened socket GENERIC at warnings line 17.
connect() on unopened socket SOCKET at warnings line 18.
Can't open bidirectional pipe at warnings line 20.
Can't exec "Buster": No such file or directory at warnings line 20.
At the end!

Suppose that you wanted to count those warnings, though? You could set up a handler. Since many of those warnings are from the compile-phase, you have to set up the handler at compile time by using a BEGIN block:

use strict;
use warnings;
use v5.12;

BEGIN {
	$SIG{__WARN__} = sub {
		state $count = 0;
		printf '[%04d] %s', $count++, @_;
		};
	}

sub chomp { ... };
*chomp = sub { ... };
chomp( $ARGV[0] );
my $sum;
exec 'Buster';
print (STDOUT), 1, 2, 3;
print $a;
accept( SOCKET, GENERIC );
connect( SOCKET, 'Mimi' );
chmod 777, 'Mimi';
open FOO, '|Buster|';
close FOO;

say 'At the end!';

Now you see each warning has a number:

[0000] Ambiguous call resolved as CORE::chomp(), qualify as such or use & at warnings line 14.
[0001] print (...) interpreted as function at warnings line 17.
[0002] Useless use of a constant (2) in void context at warnings line 17.
[0003] Useless use of a constant (3) in void context at warnings line 17.
[0004] Statement unlikely to be reached at warnings line 17.
[0005]  (Maybe you meant system() when you said exec()?)
[0006] Name "main::a" used only once: possible typo at warnings line 18.
[0007] Name "main::GENERIC" used only once: possible typo at warnings line 19.
[0008] Subroutine main::chomp redefined at warnings line 13.
[0009] Use of uninitialized value $ARGV[0] in scalar chomp at warnings line 14.
[0010] Can't exec "Buster": No such file or directory at warnings line 16.
[0011] Use of uninitialized value $_ in print at warnings line 17.
[0012] Use of uninitialized value $a in print at warnings line 18.
[0013] accept() on unopened socket GENERIC at warnings line 19.
[0014] connect() on unopened socket SOCKET at warnings line 20.
[0015] Can't open bidirectional pipe at warnings line 22.
[0016] Can't exec "Buster": No such file or directory at warnings line 22.
At the end!

That’s interesting, but it can be even more interesting. Can you label the ones that are from the compile phase? You can check the phase with the ${^GLOBAL_PHASE} variable added to Perl 5.14:

use v5.14;

BEGIN {
	$SIG{__WARN__} = sub {
		state $count = 0;
		printf '[%04d] %s - %s', $count++, ${^GLOBAL_PHASE}, @_;
		};
	}

# ... rest of program

Now the output shows the phase too:

[0000] START - Ambiguous call resolved as CORE::chomp(), qualify as such or use & at warnings line 14.
[0001] START - print (...) interpreted as function at warnings line 17.
[0002] START - Useless use of a constant (2) in void context at warnings line 17.
[0003] START - Useless use of a constant (3) in void context at warnings line 17.
[0004] START - Statement unlikely to be reached at warnings line 17.
[0005] START -  (Maybe you meant system() when you said exec()?)
[0006] START - Name "main::a" used only once: possible typo at warnings line 18.
[0007] START - Name "main::GENERIC" used only once: possible typo at warnings line 19.
[0008] RUN - Subroutine main::chomp redefined at warnings line 13.
[0009] RUN - Use of uninitialized value $ARGV[0] in scalar chomp at warnings line 14.
[0010] RUN - Can't exec "Buster": No such file or directory at warnings line 16.
[0011] RUN - Use of uninitialized value $_ in print at warnings line 17.
[0012] RUN - Use of uninitialized value $a in print at warnings line 18.
[0013] RUN - accept() on unopened socket GENERIC at warnings line 19.
[0014] RUN - connect() on unopened socket SOCKET at warnings line 20.
[0015] RUN - Can't open bidirectional pipe at warnings line 22.
[0016] RUN - Can't exec "Buster": No such file or directory at warnings line 22.
At the end!

Now each phase has its own warning counter:

START-0000  Ambiguous call resolved as CORE::chomp(), qualify as such or use & at warnings line 15.
START-0001  print (...) interpreted as function at warnings line 18.
START-0002  Useless use of a constant (2) in void context at warnings line 18.
START-0003  Useless use of a constant (3) in void context at warnings line 18.
START-0004  Statement unlikely to be reached at warnings line 18.
START-0005      (Maybe you meant system() when you said exec()?)
START-0006  Name "main::a" used only once: possible typo at warnings line 19.
START-0007  Name "main::GENERIC" used only once: possible typo at warnings line 20.
RUN-0000  Subroutine main::chomp redefined at warnings line 14.
RUN-0001  Use of uninitialized value $ARGV[0] in scalar chomp at warnings line 15.
RUN-0002  Can't exec "Buster": No such file or directory at warnings line 17.
RUN-0003  Use of uninitialized value $_ in print at warnings line 18.
RUN-0004  Use of uninitialized value $a in print at warnings line 19.
RUN-0005  accept() on unopened socket GENERIC at warnings line 20.
RUN-0006  connect() on unopened socket SOCKET at warnings line 21.
RUN-0007  Can't open bidirectional pipe at warnings line 23.
RUN-0008  Can't exec "Buster": No such file or directory at warnings line 23.
At the end!

This leads to a deliciously evil plan: what if you can stop your program from running if it had more warnings than it did on the last run? The Test::Perl::Critic::Progressive module that already does something similar for Perl::Critic. Inside this __WARN__, you can use a die to stop the program:

use strict;
use warnings;
use v5.12;

BEGIN {
	my $file = "$0.warn";
	my $count = {};

	$SIG{__WARN__} = sub { # refactor when you figure it out
		state $previous_counts = do {
			unless( -e $file ) { my $hash = {} }
			else {
				local @ARGV = $file;
				my $hash;
				while( <> ) {
					chomp;
					my( $phase, $count ) = split;
					$hash->{$phase} = $count;
					}
				$hash;
				}
			};

		$count->{${^GLOBAL_PHASE}}++;

		die "Too many warnings in ${^GLOBAL_PHASE}\n"
			if $count->{${^GLOBAL_PHASE}} >
				( $previous_counts->{${^GLOBAL_PHASE}} // 0 ); #/

		printf '%s-%04d  %s',
			${^GLOBAL_PHASE}, $count->{${^GLOBAL_PHASE}}, @_;

		};

	END { # inside a BEGIN!
		open my $f, '>', $file;
		while( my( $k, $v ) = each %$count ) {
			say $f "$k $v";
			}
		}
	}

sub chomp { ... };
*chomp = sub { ... };

# chomp( @ARGV );  # uncomment for another warning

say 'At the end!';

When you run this, the program stops when it encounters more errors that it did before:

$ perl5.14.1 warnings
START-0001  Ambiguous call resolved as CORE::chomp(), qualify as such or use & at warnings line 46.
Too many warnings in RUN

Things to remember

  • You can intercept warnings with $SIG{__WARN__}
  • Set up $SIG{__WARN__} in a BEGIN to intercept warnings right away

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Know the difference between utf8 and UTF-8

Perl actually has two encodings that get the letters u, t, f, and 8. One will happily let you do bad things, and the other will let you do bad things but with a warning that you can make fatal.

There’s an encoding layer with the name :utf8 and there’s the encoding name UTF-8 that you use with :encoding:

binmode $fh, ':utf8';
binmode $fh, ':encoding(UTF-8)';

You can even use the non-hyphen version with :encoding:

binmode $fh, ':encoding(UTF8)';

These aren’t the same thing. The :utf8 layer comes from Perl 5.6, the first version of Perl that had even rudimentary Unicode support. It encodes any characters in the range from 0 to 0xFFFF_FFFF. That is, it allows for a 32-bit encoding space. You have no problem with this code:

use 5.014;
use strict;
use warnings;

my $string = "invalid -> \x{110000}";
my $output;

{
open my $string_fh, '>:utf8', \$output;
print $string_fh $string;
}

{
open my $string_fh, '<:raw', \$output;
my @values = map { sprintf '%X', ord } split //, readline( $string_fh );
say join ' ', @values;
}

This code writes to a string filehandle using the loose utf8 encoding and opens another read filehandle using the raw filehandle so you can see the bytes without any processing. The output shows the bytes in the output. The F4 90 80 80 represents the invalid character:

69 6E 76 61 6C 69 64 20 2D 3E 20 F4 90 80 80

Going the other way, reading in the file with the same encoding, doesn't cause any problems either.

use 5.014;
use strict;
use warnings;

my $string = "invalid -> \x{110000}";
my $output;

{
open my $string_fh, '>:utf8', \$output;
print $string_fh $string;
}

{
open my $string_fh, '<:utf8', \$output;
my @values = map { sprintf '%X', ord } split //, readline( $string_fh );
say join ' ', @values;
}

When you use the same layer to read the data, you get the same characters you started. Instead of F4 90 80 80 you get 110000:

69 6E 76 61 6C 69 64 20 2D 3E 20 110000

However, the Universal Character Set highest valid code number is 0x10FFFF, and even some of the characters inside that range aren't valid in UTF-8, such as the surrogates in the range 0xD800-DFFF, which you use to encode characters in the supplementary plane in UTF-16. If none of that makes sense, just remember that UTF-16 comes from the time when we thought the UCS would be a 16-bit encoding space and that two bytes would be enough for everyone (and how often has that not be true in history?). The "characters" in the surrogate range aren't characters. They are an ugly hack to let an ancient 16-bit system deal with a 21-bit system. You shouldn't be able to successfully read those characters.

use 5.014;
use strict;
use warnings;

my $string = "invalid -> \x{D800}";
my $output;

{
open my $string_fh, '>:utf8', \$output;
print $string_fh $string;
}

{
open my $string_fh, '<:utf8', \$output;
my @values = map { sprintf '%X', ord } split //, readline( $string_fh );
say join ' ', @values;
}

This code at least emits a warning:

Unicode surrogate U+D800 is illegal in UTF-8 at invalide.pl line 9.
69 6E 76 61 6C 69 64 20 2D 3E 20 D800

You only get this warning if you turn on warnings in Perls 5.10 and 5.12, but you get it even without warnings in Perl 5.14. But, it still works.

Try any of this with the actual UTF-8 encoding though, and odd things ensue:

use 5.010;
use strict;
use warnings;

my $string = "invalid -> \x{D800}";
my $output;

{
open my $string_fh, '>:utf8', \$output;
print $string_fh $string;
}

{
open my $string_fh, '<:encoding(UTF-8)', \$output;
my @values = map { sprintf '%X', ord } split //, readline( $string_fh );
say join ' ', @values;
}

The output gives two different warnings, and some odd output:

Unicode surrogate U+D800 is illegal in UTF-8 at invalide.pl line 10.
utf8 "\xD800" does not map to Unicode at invalide.pl line 15.
69 6E 76 61 6C 69 64 20 2D 3E 20 5C 78 7B 44 38 30 30 7D

That output is much longer than the previous output. Now you get 5C 78 7B 44 38 30 30 7D. If you know your code points, you'll recognize that as the literal characters \x{D800}.

You can convince yourself that this happens by creating the encoded string directly:

use 5.010;
use strict;
use warnings;

my $string = pack 'C*', map { hex } split /\s/,
	'69 6E 76 61 6C 69 64 20 2D 3E 20 ED A0 80';
say $string;

open my $string_fh, '<:encoding(UTF-8)', \$string;
my $read = readline( $string_fh );
say $read;
my @values = map { sprintf '%X', ord } split //, $read;
say join ' ', @values;

You get the same output, still with a warning:

invalid -> ̆Ä
utf8 "\xD800" does not map to Unicode at invalide.pl line 10.
invalid -> \x{D800}
69 6E 76 61 6C 69 64 20 2D 3E 20 5C 78 7B 44 38 30 30 7D

This is a problem. The data you get aren't the data that are in the file. Writing the data with UTF-8 doesn't give a warning either:

use 5.010;
use strict;
use warnings;

my $string = "invalid -> \x{D800}";
my $output;

{
open my $string_fh, '>:encoding(UTF-8)', \$output;
print $string_fh $string;
}

{
open my $string_fh, '<:raw', \$output;
my @values = map { sprintf '%X', ord } split //, readline( $string_fh );
say join ' ', @values;
}

The output is:

"\x{d800}" does not map to utf8 at invalide.pl line 9.
69 6E 76 61 6C 69 64 20 2D 3E 20 5C 78 7B 44 38 30 30 7D

Huh? Perl will happily write the data, changing it on the way out. That's no good. Why is this happening?

There are several ways that Perl can deal with bad data as it encodes. That's not to say any of them are how Perl should deal with those data, but that's not the point. In this case, the Encode module is using its internal perlqq mode. When it finds an invalid character, it turns it into its code number and puts \x{} around it. If you were using the Encode module directly, you have control over those invalid characters.

use 5.010;
use strict;
use warnings;

use Encode qw(encode :fallbacks);

my $string = "invalid -> \x{D800}";

$string = encode( 'UTF-8', $string, FB_PERLQQ ); # what you already have

say 'The string is now[ ', $string, ']';

The output is what you got before (but without a warning because its handling is explicit):

The string is now[ invalid -> \x{D800}]

The other constants give different results:

Constant Effect String
FB_PERLQQ Replace with XML entity Convert to \x{NNNN}
FB_XMLCREF Replace with XML entity Convert to &#xdddd;
FB_HTMLCREF Replace with HTML entity Convert to &#dddddd;
FB_DEFAULT Replace with the substitution character Convert to �
FB_CROAK Die
FB_QUIET Stop encoding, with no warning
FB_WARN Stop encoding, with a warning

You probably don't want to handle everything at that level in most cases, though. If you have invalid data, you need to fix that before it gets out to the world. You have the warning though. That means that you can make that operation fatal without going through Encode:

use warnings qw(FATAL utf8);

Things to remember

  • The :utf8 encoding, and variations on it without a hyphen, is Perl's looser encoding.
  • Using UTF-8, in any case and with either a hyphen or underscore, is the strict, valid encoding and gives a warning for invalid sequences.
  • Only use the :encoding(UTF-8) and make its warnings fatal.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Know the difference between character strings and UTF-8 strings

Normally, you shouldn’t have to care about a string’s encoding. Indeed, the abstract string has no encoding. It exists as an idea without a representation and it’s not until you want to put it on disk, send it down a pipe, or otherwise force it to exist as electrical pulses, magnetic pole orientation, and so on. All stored data, even ASCII, has an encoding. Until you force it to have a bit pattern to live in the tangible world, you shouldn’t have to worry about anything like an encoding.

An abstract character string is one where Perl can recognize each grapheme cluster as a unit, and there is no encoding involved at the user level. Perl has to store these, but you don’t (shouldn’t) play with the string at that level.

A UTF-8–encoded string is one where the octets in the string are the same as in the UTF-8 representation. Perl sees a string of octets and cannot recognize grapheme clusters.

Consider this example. In

use v5.14;
use utf8;

# # # Abstract character string
my $char_string = 'Büster';

say "Length of char string is ", length $char_string; #6
say join " ", map { sprintf '%X', ord } split //, $char_string;

# # # UTF-8–encoded octet string
open my $fh, '>:utf8', \my $utf8_string;
print $fh $char_string;
close $fh;

say "Length of utf8 string is ", length $utf8_string; # 7
say join " ", map { sprintf '%X', ord } split //, $utf8_string;

The output shows that the same are two are different things because one is a string of characters and one a string of octets:

use v5.14;
use utf8;

# # # Abstract character string
my $char_string = 'Büster';

say "Length of char string is ", length $char_string; #6
say join " ", map { sprintf '%X', ord } split //, $char_string;

# # # UTF-8–encoded octet string
open my $fh, '>:utf8', \my $utf8_string;
print $fh $char_string;
close $fh;

say "Length of utf8 string is ", length $utf8_string; # 7
say join " ", map { sprintf '%X', ord } split //, $utf8_string;

The output shows the difference. In the character string, the ü shows up as the single character with code number 0xFC. In the UTF-8 version, the code number 0xFC is represented as 0xC3 0xBC. Since this is just a string of octets, Perl thinks that this version is one character longer:

Length of char string is 6
42 FC 73 74 65 72
Length of utf8 string is 7
42 C3 BC 73 74 65 72

For most of your programming, you shouldn’t have to care about encoding. You want to have character data with no representation and operate on abstract characters. You don’t care at all about the encoding and how many bytes a character turns into. That’s merely a storage issue. Virtually no one can tell you, off the top of their heads, what the UTF-8 representation of a string is because no one thinks in UTF-8. No one wants to do that during string manipulation, either.

The problem is that some interfaces want the encoded data instead of the abstract character string. These modules usually expect that you’re giving it data directly from another source without turning it into a Perl string. If you need to review these concepts, check out the “Unicode” chapter in Effective Perl Programming.

Consider the JSON module’s decode function expects a UTF-8–encoded string, thinking you’re going to take it directly from an HTTP response. This item is not about using this module correctly, but it’s a convenient example for the general idea.

This works just fine because the value in $json_data is a UTF-8–encoded string instead of a abstract character string:

use JSON;
use LWP::Simple qw(get);

my $json_data = get( 'http://www.example.com/data.json' );

my $perl_hash = decode_json( $json_data );

The decode_json doesn’t expect you to do anything with the data that you get from the website before you give it to decode_json, who’s job it is to both decode the data and to convert the data from JSON to Perl. It’s documented this way. Instead of making you decode it in the response, it uses the data just as you would get it in the message body of the HTTP response.

If you are doing extra processing, however, you can get in trouble. For instance, the HTTP::Response object can decode the message body for you, turning UTF-8 data into an abstract character string. If you call decoded_content and pass the result to decode_json, it fails:

use Encode;
use JSON;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

my $response = $ua->get( 'http://www252.pair.com/~comdog/for/data.json' );

my $content = $response->content;
print "Length content is ", length $content, "\n";

my $decoded_content = $response->decoded_content;
print "Length decoded content is ", length $decoded_content, "\n";

# this is fine
my $perl_hash = decode_json( $content );

# this is not fine
my $decoded_hash = decode_json( $decoded_content );

If you have your input string as an abstract character string, the decode method might fail. If it’s all characters in the ASCII range, it doesn’t matter because the UTF-8 representation is the same as the ASCII representation:

use utf8;
use JSON;

my $json_data = q( { "cat" : "Buster" } );

my $perl_hash = decode_json( $json_data );

Give it something outside the ASCII range, and things go wrong:

use utf8;
use JSON;

my $json_data = qq( { "cat" : "Büster" } );

my $perl_hash = decode_json( $json_data );

The error says it has a malformed UTF-8 character. In an abstract character string, the ü is 0xFC, which isn’t a valid UTF-8 sequence:

malformed UTF-8 character in JSON string, at character offset 13 (before "\x{33d25ca2} } ") at string.pl line 6.

In this case, you need to turn your abstract character string into a UTF-8–encoded string, just like it would look as if you had stored it in a file. You can encode it (going from the abstract character string to the UTF-8 version) with the Encode module (Item 75. Convert octet strings to character strings.):

use utf8;
use Encode qw(encode_utf8);
use JSON;

my $json_data = qq( { "cat" : "Büster" } );
$json_data = encode_utf8( $json_data );

my $perl_hash = decode_json( $json_data );

You can also print to a scalar reference, using the encoding that you need (Item 54. Open filehandles to and from strings):

use utf8;
use Encode qw(encode_utf8);
use JSON;

my $json_data = qq( { "cat" : "Büster" } );
open my $fh, '>:utf8', \my $utf8_string;
print $fh $json_data;

my $perl_hash = decode_json( $utf8_string );

If you already have the text in a file and need it un-decoded, you can read it with the :raw layer so perl does not decode it (possibly with default layers set far away):

use Encode qw(encode_utf8);
use JSON qw(decode);

open my $fh, '<:raw', $file;
my $json_data = do { local $/; <$fh> };

my $perl_hash = decode_json( $utf8_string );

Doing it differently in JSON

You don’t have to use JSON‘s decode_json function. Using the object interface, you can tell the decoder what you’re giving it. If you want to give it a UTF-8–encoded string, you tell it to expect UTF-8:

use JSON;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

my $response = $ua->get( 'http://www.example.com/data.json' );

my $content = $response->content;

my $perl_hash = JSON->new->utf8->decode( $content );

If you want to give it character data, you don’t tell the object to expect UTF-8:

use JSON;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

my $response = $ua->get( 'http://www.example.com/data.json' );

my $content = $response->decoded_content;

my $perl_hash = JSON->new->decode( $decoded_content ); # no ->utf8

Things to remember

  • Character string have no encoding, and Perl can recognize its grapheme clusters
  • An encoded string is a series of octets that Perl doesn’t recognize as grapheme clusters
  • Check your interface to see which one you should use

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit