Don’t use auto-dereferencing with each or keys

Perl 5.14 added an auto-dereferencing features to the hash and array operators, and I wrote about those in Use array references with the array operators. I’ve never particularly liked that feature, but I don’t have to like everything. Additionally, Perl 5.12 expanded the job of keys and values to also work on arrays.

chromatic has explicated a problem with each, which is both an array and hash operator. He details it in Inadvertent Inconsistencies: each in Perl 5.12 and Inadvertent Inconsistencies: each versus Autoderef. In short, if you use it with a reference, Perl doesn’t know until it actually executes the each if it’s going to use it’s array or hash behavior (and in some cases, blow up with either). However, as the programmer, I probably know which behavior I want:

while( my( $index, $value ) = each $ref ) { my $elem = $other_array->[$index]; } # I want array behavior
while( my( $key, $value ) = each $ref ) { ... } # I want hash behavior

The problem isn’t when it blows up, which is easy to catch (it blows up). If you get the wrong sort of reference, you’ll get nonsensical indices or keys. If you have an array reference, you’ll get numbers with the first return value. If you have a hash reference, you’ll get strings. If you get strings but treat them as array indices, you’ll likely always get array index 0, unless the key is a number. You might even get an odd index. If the key is 123Buster, you’ll get array index 123 due to Perl’s numification. Going the other way, using an array reference when you expected a hash, you’ll have to find keys that are whole numbers.

Effective programs reduce ambiguity in their code, but this new feature increases it. It’s easy to fix; you dereference them yourself. If you have the wrong reference type, you’ll find out right away:

while( my( $index, $value ) = each @$ref ) { my $elem = $other_array->[$index]; } # I want array behavior
while( my( $key, $value ) = each %$ref ) { ... } # I want hash behavior

If you really wanted to keep the auto-dereferencing feature, you could check the reference type before you use it, but what’s the point of saving a character with the auto-dereferencing if you have to wrap the whole thing in a guard condition?

if( ref $ref eq ref [] ) {
    while( my( $index, $value ) = each @$ref ) { ... }
    }

Now keys has the same problem. You can use that either with an array or a hash, but at some point you’re probably going to have to know what sort of reference you have so you can use the key to dereference it. At that point, you effectively declare what sort of reference it should have been. If you have the wrong sort of reference, your script dies:

my $ref = [ ... ];
foreach my $key ( keys $ref ) {
    my $elem = $ref->{$index}; # Big error!
    }

This problem is the unintended consequence of letting the other array and hash operators take a scalar variable as an argument and letting the parser automatically add the bits to dereference. David Golden wanted more magic syntax and the patch wasn’t so tough. To get the nicer syntax in some cases you end up dealing with more special cases. I noted this at the time David proposed it, but his enthusiasm for the interesting parts of the problem steamrolled over the bad parts.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Look up Unicode properties with an inversion map

Perl comes with extracts of the Unicode character data, but it hasn’t been easy to look up all of the information Perl knows about a character. Perl v5.15.7 adds a way to created an inverted map based on the property that you want to access.

The Unicode::UCD module gives you access to some of the information about a character:

use Unicode::UCD 'charinfo';
use charnames qw(:full);
use Data::Dumper;

my $charinfo   = charinfo(
	ord( "\N{SMILING CAT FACE WITH OPEN MOUTH}" )
	);
print Dumper( $charinfo );

The output has many of the properties, but not all of them:

$VAR1 = {
		  'digit' => '',
		  'bidi' => 'ON',
		  'category' => 'So',
		  'code' => '1F63A',
		  'script' => 'Common',
		  'combining' => 0,
		  'upper' => '',
		  'name' => 'SMILING CAT FACE WITH OPEN MOUTH',
		  'unicode10' => '',
		  'decomposition' => '',
		  'comment' => '',
		  'mirrored' => 'N',
		  'lower' => '',
		  'numeric' => '',
		  'decimal' => '',
		  'title' => '',
		  'block' => 'Emoticons'
		};

This doesn’t include the Age of the character, that is, when the character was added to Unicode. This might seem like a silly thing to know, but it came in handy typesetting Programming Perl. We had problems with some characters but we couldn’t see a pattern until we looked at the age of all the problem characters. Any character added after Unicode 4.0 didn’t typeset correctly. It took some annoying work to get the age by scanning through each age until that property matched:

#!/Users/brian/bin/perls/perl5.15.7

use v5.10;
use utf8;

use List::Util qw(first);

my @chars =  ( 'a', '→', '⣽', "\N{SMILING CAT FACE WITH OPEN MOUTH}" );

my @ages = qw( 1.1 2.1 2.0 3.0 3.1 3.2 4.0 4.1 5.0 5.1 5.2 6.0 );

foreach my $char ( @chars ) {
	my $age = first { $char =~ /\p{Age=$_}/ } @ages;
	say "Age: $age";
	}

It works, but it’s an unsatisifying kludge:

a Age: 1.1
→ Age: 1.1
⣽ Age: 3.0
😺 Age: 6.0

Now, Unicode::UCD has a prop_invmap to create an index based on a property you choose and a _search_invlist to return the offset in the map:

#!/Users/brian/bin/perls/perl5.15.7

use 5.15.7;
use utf8;

use charnames qw(:full);
use List::Util qw(first);
use Unicode::UCD;

my @chars =  ( 'a', '→', '⣽', "\N{SMILING CAT FACE WITH OPEN MOUTH}" );

my @ages = qw( 1.1 2.1 2.0 3.0 3.1 3.2 4.0 4.1 5.0 5.1 5.2 6.0 );

foreach my $char ( @chars ) {
	my $age = age_of_char( $char );
	say "$char Age: $age";
	}

sub age_of_char {
	my( $char ) = @_;
	# create the inverted list, once
	# can only initialize as scalar
	state $inv = _make_age_inverted_list();

	my $i = Unicode::UCD::_search_invlist($inv->[0], ord $char);
	return $inv->[1][$i];
	}

# create the inverted list, once
sub _make_age_inverted_list {
	state( $list, $map, $format, $default, $init );
	unless( $init++ ) {
		($list, $map, $format, $default) = Unicode::UCD::prop_invmap("Age");
		$format eq "s" || die "wrong format $format";
		}
	return [ $list, $map ];
	}

That looks like a lot of work, but most of it happens once to setup the inversion map.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Fold cases properly

You might think that you know how to compare strings regardless of case, and you’re probably wrong. After you read this Item, you’ll be able to do it correctly and without doing any more work than you were doing before. Perl handles all the details for you.

If you grew up in the ASCII world, case insensitivity is a difference of literally one bit, so changing case is setting or unsetting a bit in the octet that represents that character.

If you’ve read the Perl FAQ, you may have seen this quip:

“Perl” is the name of the language. Only the “P” is capitalized. The name of the interpreter (the program which runs the Perl script) is “perl” with a lowercase “p”.

When Larry Wall was asked what the difference between “Perl” and “perl”, he said “One bit”. It’s literally a difference of flipping one bit in the ASCII representation. That’s as complicated as ASCII case folding gets.

The capital letter P has the ordinal value 0b1010000. The small letter p, which shows up later in the ASCII sequence, has the ordinal value 0b1110000. This makes it extremely easy to write routines to change between upper and lower cases:

use v5.10;

say "  U L";
say "-----";

foreach my $char ( qw(p P a b c A B C) ) {
	my $lower = chr( ord($char) | 0b0100000 );
	my $upper = chr( ord($char) & 0b1011111 );

	say "$char $upper $lower";
	}

The output shows what you’d expect for the upper and lower cases:

  U L
-----
p P p
P P p
a A a
b B b
c C c
A A a
B B b
C C c

Since bit flipping is easy to do, it’s very easy for even primitive computers to quickly change case (assuming that you’re not so primitive as to not have two cases). But, this only works if you restrict the output to the ASCII letters. If you want to handle non-letters, you have to do a bit more work to ensure that you don’t shift them into other characters:

use v5.10;

say "  U L";
say "-----";

foreach my $char ( qw(p P a b c A B C # !) ) {
	my $upper = uppercase( $char );
	my $lower = lowercase( $char );

	say "$char $upper $lower";
	}

 sub lowercase {
 	my $_ = shift;
  	my $ord = ord();

 	return $_ unless $ord >= 0x41 and $ord <= 0x5A;
	return chr( $ord ^ 0b100000 );
	}

 sub uppercase {
 	my $_ = shift;
 	my $ord = ord();

 	return $_ unless $ord >= 0x61 and $ord <= 0x7A;
	return chr( $ord ^ 0b100000 );
	}

Now the non-letters stay the same character:

  U L
-----
p P p
P P p
a A a
b B b
c C c
A A a
B B b
C C c
# # #
! ! !

This almost works for Latin-* encodings too. When you move out of the ASCII sequence into Unicode, you don't have this luxury, and it's not merely a representational issue.

If you were infected with ASCII early, you've grown up thinking that you can go back and forth between upper and lower cases and always get the same result. Outside of ASCII, that's not necessarily true. Consider the word "Reichwaldstraße", a common street name in Germany. The "straße" has the special character ß (U+00DF ʟᴀᴛɪɴ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ꜱʜᴀʀᴘ ꜱ). which is a ligature of a long s, the fancy ſ (U+017F ʟᴀᴛɪɴ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ʟᴏɴɢ ꜱ) that you may have seen in historical documents, and the familiar short s. Put them together, ſs, and move them close enough and you can see how you would end up with ß once you connect the hanging portion of the long s with the top of the short s. The UCS has an uppercase version (U+1E9E ʟᴀᴛɪɴ ᴄᴀᴘᴛɪᴀʟ ʟᴇᴛᴛᴇʀ ꜱʜᴀʀᴘ ꜱ), although no one uses it aside from saying that no one uses it. U+1E9E lowercases to U+00DF, but U+00DF has no single character uppercase version; it's the two characters SS. The lowercase of SS, however, is ss:

use utf8;

my $string = "Reichwaldstraße";

my $upper = uc( $string );
my $lower = lc( $upper  );

print <<"HERE";
Started with: $string
Upper:        $upper
Lower:        $lower
HERE

The output shows that you don't get back to the original:

Started with: Reichwaldstraße
Upper:        REICHWALDSTRASSE
Lower:        reichwaldstrasse

There's another s that causes problems: the Greek sigma, which comes in two lowercase forms. One appears in the middle of words and the other appears at the end, as in όσος, where σ and ς represent the same thing, just in different forms mandated by their position:

use utf8;

my $char = "όσος";

my $upper = uc( $char );
my $lower = lc( $upper );

print <<"HERE";
Started with: $char
Upper:        $upper
Lower:        $lower
HERE

Again, the lowercase version at the end is different than what you started with:

Started with: όσος
Upper:        ΌΣΟΣ
Lower:        όσοσ

This means that you can't merely use lc to normalize text for case insensitive comparison. These won't compare correctly:

lc( "Reichwaldstraße" ) eq lc( "REICHWALDSTRASSE" );  # Nope!
lc( 'όσος' ) eq lc( 'ΌΣΟΣ' );                         # Nope!

You might object that these are different strings and that they shouldn't be the same, but where did these strings start? Perhaps that REICHWALDSTRASSE was not originally all uppercase, but changed by some stupid filters between you and the original information (and with a name like mine, I know about stupid casing filters). That's part of the ASCII infection.

So, lc is the wrong way. Sadly, we do this incorrectly in Learning Perl, when we show this subroutine we want to sort:

sub case_insensitive { "\L$a" cmp "\L$b" }

The Unicode specification solves this with its case folding rules. In short, it folds characters with different case forms into a common form. There's not a rule for this; they do it by exhaustion, specifying the common form for each fold. The common form is defined in the Unicode Character Database, which the Perl developers have digested into the files you find in the unicore/ directory in your Perl library. Here's a few lines from unicore/CaseFolding.txt:

0050; C; 0070; # LATIN CAPITAL LETTER P
0051; C; 0071; # LATIN CAPITAL LETTER Q
0052; C; 0072; # LATIN CAPITAL LETTER R
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
03A3; C; 03C3; # GREEK CAPITAL LETTER SIGMA
03C2; C; 03C3; # GREEK SMALL LETTER FINAL SIGMA
FB00; F; 0066 0066; # LATIN SMALL LIGATURE FF
FB01; F; 0066 0069; # LATIN SMALL LIGATURE FI
FB02; F; 0066 006C; # LATIN SMALL LIGATURE FL
FB03; F; 0066 0066 0069; # LATIN SMALL LIGATURE FFI
FB04; F; 0066 0066 006C; # LATIN SMALL LIGATURE FFL

The first column is the code number of the original character, the second is the type of folding (explained in the data file and coming up later), and the third column are the code numbers that form the common, folded ("equivalent") version. Essentially, it's a big hash. Notice that some of the folded versions are multiple characters. You're not going to get that with bit fiddling.

Case folding takes the character in the first column and turns them into the characters in the third column, then takes the result and does it again until there are no more folds possible. It keeps doing that until there is nothing to replace. Characters that don't have an entry in this file fold into themselves. You case fold to compare strings, not to normalize strings for storage or other uses. Case folding makes case insensitive comparisons very fast, but it also loses information that you can't recover. You can read the exact rules in Section 5.18, "Case mapping", of the Unicode Standard.

To see how that works, try that with Reichwaldstraße and όσος. All characters except two stay the same, and two use the mapping from unicore/CaseFolding.txt:

  • Reichwaldstraße → reichwaldstrasse
  • REICHWALDSTRASSE → reichwaldstrasse
  • όσος → ΌΣΟΣ
  • ΌΣΟΣ → όσοσ

To implement these operations, Perl v5.16 adds the fc built-in function. Instead of lc, use that:

use v5.15.8;  # until we get v5.16  XXX feature
fc( "Reichwaldstraße" ) eq fc( "REICHWALDSTRASSE" );  # Yep!
fc( 'όσος' ) eq fc( 'ΌΣΟΣ' );                         # Yep!

If you don't have v5.16, you can use the fc front the Unicode::CaseFold module on CPAN.

If you wanted to do this inside a double-quoted string, you can use the \F case shift operator (but be aware of the things we noted in Understand the order of operations in double quoted contexts). Our Learning Perl example could change to:

sub case_insensitive { "\F$a" cmp "\F$b" }

More complicated folds

Looking back at the extract of unicore/CaseFolding.txt, you might remember that I skipped over the second column, the mapping status. Those letters stand for different folding rules:

  • C: common case folding
  • F: full case folding (strings may grow in length)
  • S: simple case folding (map to single characters)
  • T: special case for uppercase I and dotted uppercase I

The "T" status stands in for folds that the general rules can't handle, mostly some characters from Turkish and similar languages.

So far, Perl's fc only handles the "F" status for full case folding. It doesn't handle the special folding you'll find in unicore/SpecialCasing.txt that has the oddball situations, such as multiple source characters folding onto other multiple characters. If you want to handle those, you're on your own, although the Unicode::Casing module on CPAN might help.

Many of the folding rules depend on the source language, so you'll probably want to pay special attention if you are using that language or completely ignore them if you are not.

Besides that, the Universal Character Set gives people much more of a chance to mess up. Suppose that you want to write "β-carotene", that thing you get from carrots. That first character is β (U+03B2 ɢʀᴇᴇᴋ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ʙᴇᴛᴀ). Some people might think it looks like ß (U+00DF ʟᴀᴛɪɴ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ꜱʜᴀʀᴘ ꜱ), and that's good enough for them. No amount of case folding is going to let you know that someone used an incorrect character. But, this is also one of the benefits of Unicode: characters know what they are.

Another correct way

There's another correct way to check strings regardless of case. You can use the /i flag on the match operator. The Unicode-aware Perl regex engine handles the rest:

use utf8;
use v5.15.7;

use Set::CrossProduct;

my $string = "Reichwaldstraße";

my $upper = uc( $string );
my $lower = lc( $upper  );

my $sets = Set::CrossProduct->new(
	[
	[ $string, $upper, $lower ],
	[ $string, $upper, $lower ],
	]
	);

foreach my $tuple ( $sets->combinations ) {
	my( $l, $r ) = @$tuple;
	next if $l eq $r;

	say "lc($r) eq lc($l)  ? ", lc($r) eq lc($l) ? "matched" : "failed";
	say "fc($r) eq fc($l)  ? ", fc($r) eq fc($l) ? "matched" : "failed";
	say "$r =~ m/$l/i      ? ", $l =~ m/$r/i ? "matched" : "failed";

	say;
	}

In the output, you can see that lc sometimes fails, but that the fc and m//i always works:

lc(REICHWALDSTRASSE) eq lc(Reichwaldstraße)  → failed
fc(REICHWALDSTRASSE) eq fc(Reichwaldstraße)  → matched
REICHWALDSTRASSE =~ m/Reichwaldstraße/i      → matched

lc(reichwaldstrasse) eq lc(Reichwaldstraße)  → failed
fc(reichwaldstrasse) eq fc(Reichwaldstraße)  → matched
reichwaldstrasse =~ m/Reichwaldstraße/i      → matched

lc(Reichwaldstraße) eq lc(REICHWALDSTRASSE)  → failed
fc(Reichwaldstraße) eq fc(REICHWALDSTRASSE)  → matched
Reichwaldstraße =~ m/REICHWALDSTRASSE/i      → matched

lc(reichwaldstrasse) eq lc(REICHWALDSTRASSE)  → matched
fc(reichwaldstrasse) eq fc(REICHWALDSTRASSE)  → matched
reichwaldstrasse =~ m/REICHWALDSTRASSE/i      → matched

lc(Reichwaldstraße) eq lc(reichwaldstrasse)  → failed
fc(Reichwaldstraße) eq fc(reichwaldstrasse)  → matched
Reichwaldstraße =~ m/reichwaldstrasse/i      → matched

lc(REICHWALDSTRASSE) eq lc(reichwaldstrasse)  → matched
fc(REICHWALDSTRASSE) eq fc(reichwaldstrasse)  → matched
REICHWALDSTRASSE =~ m/reichwaldstrasse/i      → matched

The match operator isn't useful for sort though, since you can only tell if the strings are the same.

Things to remember

  • Case-folding is more complicated than merely lowercasing.
  • The fc does proper case folding according to the Unicode standard.
  • The \F case fold operator does full case folding in double-quoted contexts.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Use __SUB__ to get a reference to the current subroutine

What if you want to write a recursive subroutine but you don’t know the name of the current subroutine? Since Perl is a dynamic language and code references are first class objects, you might not know the name of the code reference, if it even has a name. Perl 5.16 introduces __SUB__ as a special sequence to return a reference to the current subroutine. You could almost do the same thing without the new feature, but each of those have drawbacks you might want to avoid.

Although __SUB__ looks like __FILE__, __LINE__, and __PACKAGE__, each of which are compile-time directives, the __SUB__ happens at run time so you can use it with subroutines you define later.

First, consider how you’d try to do this without the __SUB__ feature. You could declare a variable to hold a subroutine reference then in a later statement define the subroutine. Since you’ve already declared the variable, you can use it in the definition. Perl won’t de-reference it until you actually run the subroutine, so it doesn’t matter that it’s not a reference yet:

use v5.10;

my $sub;

$sub = sub {
	state $count = 10;
	say $count;
	return if --$count < 0;
	$sub->();
	};

$sub->();

Your output is a countdown:

10
9
8
7
6
5
4
3
2
1
0

To do that, there are two requirements: the code reference must be stored in a variable, and the variable must already be defined. That’s not always convenient. Not only that, your anonymous subroutine contains a reference to itself, so you’d either have to play games with weak references or just let the reference live forever. Neither of those are attractive.

Rafaël Garcia-Suarez solved these problems by creating Sub::Current to give you a ROUTINE function that returns a reference to the current subroutine, even if it is a named subroutine:

use v5.10;
use Sub::Current;

sub countdown {
	state $count = 10;
	say $count;
	return if --$count < 0;
	ROUTINE->();
	};

countdown();

You might want to define these code references as a single statement, even you don’t need to. This is useful for inline subroutines where you want to define the code reference in the parameter list:

use v5.10;
use Sub::Current;

sub run { $_[0]->() };

run( sub {
		state $count = 10;
		say $count;
		return if --$count < 0;
		ROUTINE->();
		}
	);

You may want to define the subroutine in one statement as a return value:

use v5.10;
use Sub::Current;

sub factory {
	my $start = shift;
	sub {
		state $count = $start;
		say $count;
		return if --$count < 0;
		ROUTINE->();
		}
	};

factory(4)->();

Using this module has the disadvantage of a CPAN dependency, although a very light one because it’s self contained. There’s another module, Devel::Caller, from Richard Clamp that can can get a code reference from any level in the call stack, including the current level:

use v5.10;
use Devel::Caller qw(caller_cv);

sub factory {
	my $start = shift;
	sub {
		state $count = $start;
		say $count;
		return if --$count < 0;
		caller_cv(0)->();
		}
	};

factory(7)->();

Perl 5.16 lets you do the same thing without the CPAN module:

use v5.15.6;  # until v5.16 is released

sub factory {
	my $start = shift;
	sub {
		state $count = $start;
		say $count;
		return if --$count < 0;
		__SUB__->();
		}
	};

As with many new features added since Perl v5.10, you can enable __SUB__ with a use VERSION statement,
as you see in the previous example, or with the feature pragma and the current_sub import:

use feature qw(say state current_sub);

sub factory {
	my $start = shift;
	sub {
		state $count = $start;
		say $count;
		return if --$count < 0;
		__SUB__->();
		}
	};

factory(7)->();

Things to remember

  • Perl v5.16 provides the __SUB__ directive to return a reference to the currently running subroutine
  • Import this new feature by requiring the Perl version or through
    the feature pragma

  • Prior to Perl v5.16, you can do this the same thing with Sub::Current

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Understand the order of operations in double quoted contexts

Perl’s powerful string manipulation tools include case-shifting operators that change the parts of a double-quoted string. There are many other things that happen in a double-quoted string too, so you need to know where these operators fit in with each other.

A double-quoted string has three features:

  • Variable interpolation
  • Escaped and logical characters
  • Case shift operators

You might have missed this because the documentation doesn’t emphasize it. There is a single sentence in perlop, but in relation to the regular expression operators and the \Q:

For double-quoted strings, the quoting from \Q is applied after interpolation and escapes are processed.

If you don’t pay attention to the order of these operations, you’ll get results that you might not expect. The problem is that the order of operations isn’t the same in all double-quoted contexts.

In strings, the order of operations is the same as listed earlier:

  • Variable interpolation
  • Escaped and logical characters
  • Case shift operators

Variable interpolation

You already know about variable interpolation. This is one of Perl’s greatest features, and the one I miss the most when I have to use a different language:

my $cat = 'Roscoe';
my $string = "Buster $cat Mimi";

In a double quoted context, Perl substitutes the value of $cat. You end up with Buster Roscoe Mimi.

Case-shift operators

The case-shift operators change parts of a double-quoted string. Although we call them “case shift”, not all of them change the case.

Operator Effect Function equivalent
\U Uppercase everything following uc
\u Uppercase the next character ucfirst
\L Lowercase everything following lc
\l Lowercase the next character lcfirst
\F (v5.16) Lowercase everything following fc
\Q Quote metacharacters quotemeta
\E Stop whatever you were doing

The \F and fc are new for the yet unreleased Perl v5.16. Those will show up in a different Item. Notice there’s no \f for a fcfirst. That double-quoted sequence already means “form feed”, the instruction to printers to stop the current page and start a new page.

Look at some examples using these in a double-quoted string:

% perl -e 'print "\ubuster\n"'
Buster
% perl -e 'print "\LBUSTER\n"'
buster
% perl -e 'print "\Ubuster\n"'
BUSTER
% perl -e 'print "\Ubus\Eter\n"'
BUSter
% perl -e 'print "\LBUST\EER\n"'
bustER
% perl -e 'print "\QP*rl\n"'
P\*rl\

That last one is a bit odd. It looks like it ends with a \. It doesn’t really end like that because there’s a newline that \Q quoted:

% perl -e 'print "\QP*rl\n"' | hexdump -C
00000000  50 5c 2a 72 6c 5c 0a                 |P\*rl\.|
00000007

Perl handled the “\n” before it handled the \Q, but the meta-character quoter thinks the newline is a special character so it escapes it. An escaped newline is just a newline, though.

Now, combine these with variable interpolation. Perl handles the variables first then does the case shifting:

use 5.14.1;

my $cat = 'Buster';

say "Roscoe $cat Mimi";
say "Roscoe \U$cat Mimi";
say "Roscoe \U$cat\E Mimi";

The results are probably not surprising. The first line is just interpolation, the second line uppercases everything from \U to the end, and the third line uppercases only the parts between the \U and the \E:

Roscoe Buster Mimi
Roscoe BUSTER MIMI
Roscoe BUSTER Mimi

If the case shift happens after interpolation, you might think that you could interpolate a case shift:

use 5.14.1;

my $cat = '\UBuster'; # no case shift in a single quote!

say "Roscoe $cat Mimi";

That doesn’t work though. The intended case shift operator shows up as literal characters because Perl doesn’t do double processing:

Roscoe \UBuster Mimi

A \U inside the string doesn’t bother the escaped characters because Perl has already processed those:

use 5.14.1;

my $cat = 'Buster';

say "Roscoe \U$cat\a\n Mimi";

The “\n” is still a newline and the “\a” is still the bell, and everything after the \U is uppercased (if it has an uppercase equivalent).

That seems simple enough. It’s variable interpolation followed by character escapes followed by case shifting. But this is Perl, so it can’t be that easy.

Regular expression double quoting

The regular expression operators (qr, m//, and s///) handle the double quote operations differently. From perlop:

For the pattern of regex operators (qr//, m// and s///), the quoting from \Q is applied after interpolation is processed, but before escapes are processed.

Now the order is of operations is:

  • Variable interpolation
  • Case shift operators
  • Escaped and logical characters

You can see this when you print the stringified forms of the patterns:

% perl -le 'print qr/\Q\n/'
(?-xism:\\n)
% perl -le 'print qr/\U\n/'
(?-xism:\N)

You probably expect all of these to match, but not all of them do:

% perl -le 'print "\n" =~ qr/\n/ ? "Yes" : "No"'
Yes
% perl -le 'print "\n" =~ qr/\Q\n/ ? "Yes" : "No"'
No
% perl -le 'print "\n" =~ qr/\U\n/ ? "Yes" : "No"'
No
% perl -le 'print "\n" =~ qr/\l\n/ ? "Yes" : "No"'
Yes
% perl -le 'print "\n" =~ qr/\L\n/ ? "Yes" : "No"'
Yes

The last two times are curious. The \l and \L leave the n as a lowercase n so in the last step, the \n is still a newline. Those two tests still match.

This means that you can construct a string and a pattern with the same sequence of characters, but they might not match:

% perl -le 'print "\Q\n" =~ qr/\Q\n/ ? "Yes" : "No"'
No
% perl -le 'print "\U\n" =~ qr/\U\n/ ? "Yes" : "No"'
No
% perl -le 'print "\L\n" =~ qr/\L\n/ ? "Yes" : "No"'
Yes

It’s even worse. What does the \N mean? It depends on the Perl version:

% perl5.10.1 -le 'print "\n" =~ qr/\N/ ? "Yes" : "No"'
Missing braces on \N{} in regex; marked by <-- HERE in m/\N <-- HERE / at -e line 1.
% perl5.12.1 -le 'print "\n" =~ qr/\N/ ? "Yes" : "No"'
No
% perl5.14.1 -le 'print "\n" =~ qr/\N/ ? "Yes" : "No"'
No

Perl v5.12 added \N as "not a newline" to replace the . no matter which default regex switches you have. That's why Perl v5.10 thinks you have an incomplete \N{CHARNAME}. The others match a newline because the case shift happens in the middle of the process:

% perl5.10.1 -le 'print "\n" =~ qr/\L\N/ ? "Yes" : "No"'
Yes
% perl5.14.1 -le 'print "\n" =~ qr/\L\N/ ? "Yes" : "No"'
Yes
% perl5.8.9 -le 'print "\n" =~ qr/\L\N/ ? "Yes" : "No"'
Missing braces on \N{} at -e line 1, near "\L"
Execution of -e aborted due to compilation errors.

With the \N{CHARNAME} syntax, you can match characters by their name in the Universal Character Set. Here you match an uppercase A:

% perl -Mcharnames=:full -le 'print "A" =~ qr/\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
Yes

If you put a \L in front of that, you might think it would match the lowercase version of the named letter. There's no such luck because the \L affects the pattern before the \N{CHARNAME}:

% perl -Mcharnames=:full -le 'print "A" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
No
% perl -Mcharnames=:full -le 'print "a" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
No

The \N turns into a newline and the braces are now for a quantifier with a non-number in it:

% perl -Mcharnames=:full -le 'print qr/\L\N{LATIN CAPITAL LETTER A}/'
(?-xism:\n{u+41})

You might think that this would match anything since that should probably turn into \n{0} just like the values in the array index turn into integers. The perlre section on "Quantifiers" don't say what should happen, but if it's not a number, the braces become literals. Here's a simple demonstration that those braces are literals:

% perl -le 'print "\n{a}" =~ qr/\n{a}/ ? "Yes" : "No"'
Yes

Here's the pattern you created before, and that you want to match now:

% perl -Mcharnames=:full -le 'print qr/\L\N{LATIN CAPITAL LETTER A}/'
(?^u:\n{u+41})

It doesn't match a lowercase a:

$ perl -Mcharnames=:full -le 'print "a" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
No

It doesn't match a newline either. The pattern in \n{u+41} and that's not a quantifier. There are some characters after the \n, so the target string doesn't have enough characters to match:

% perl -Mcharnames=:full -le 'print "\n" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
No

Using the regular expression text doesn't work either, which you might miss on the first pass:

% perl -Mcharnames=:full -le 'print "\n{u+41}" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
No

Of course! That + is a quantifier, so it isn't a literal character that should show up in the string. So this works:

% perl -Mcharnames=:full -le 'print "\n{u41}" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
Yes

This works too because you can have one or more of u:

% perl -Mcharnames=:full -le 'print "\n{uuuuu41}" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
Yes

If you don't want the \L to extend into character name sequence, you can use the \E to limit its effect:

% perl -Mcharnames=:full -le 'print "bar" =~ qr/\LB\E\N{LATIN SMALL LETTER A}r/ ? "Yes" : "No"'
Yes

Things to remember

  • The double quote string constructor handles variable interpolation, special characters, and case shift operators in that order.
  • The regular expression operators handles variable interpolation, case shift operators, and special characters in that order.
  • Double-quoted interpolation in a match operator happens before regular expression compilation.
  • The min-max quantifier is only a quantifier if you give it numbers. Otherwise, it's literal characters.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

The Effective Perler in 2012 and beyond

Two years ago, Josh McAdams and I started The Effective Perler as an extension of the second edition of Effective Perl Programming. Since then, roughly once a week, we added one meaty Item a week. Last month, we published our 100th Item. With the 120 Items in the book, that’s a lot of items.

I have a new plan for 2012 onward. It’s much harder to find topics now and it takes much longer to research and write them. I’ve exhausted all of the advice I have and all of the easy topics. When I think I have a good idea, I now know to search everything else I’ve already written. More than a couple of times I thought I had the next week’s idea, but it was already in the book or on the website.

The search for content has another problem: I don’t want to add Items for anything that’s already been written—not just by me, but by anyone. I don’t want to repeat content unless I have a different take on it and I can illuminate something new.

I’m not going to do weekly big Items anymore. I’ll try one a month, I think. I’ll see how it goes.

There’s still roomer for shorter content, such as the short demonstrations of new features, and ideas that don’t have have 1,000 words in them. There are also many interesting, although esoteric, features that I would probably never recommend for production code.

This doesn’t mean that I’m going to write less, though. If I do less for The Effective Perler, I can do more somewhere else. There another edition of Intermediate Perl that needs attention, the Learning Perl website, or a few other things.

Many people have asked for Items about specific modules, but that’s not really the idea of The Effective Perler. We want to teach people about core Perl and thinking in Perl. Modules are essentially all the same—they give you an interface and you do what the interface tells you to do. For the most part, they are just subroutines or method calls. There’s not much interesting there. You already know how to do that. The much more interesting advice is researching modules, but we put that in the book already.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Define grammars in regular expressions

[ This is the 100th Item we've shared with you in the two years this blog has been around. We deserve a holiday and we're taking it, so read us next year! Happy Holidays.]

Perl 5.10 added rudimentary grammar support in its regular expressions. You could define many subpatterns directly in your pattern, use them to define larger subpatterns, and, finally, when you have everything in place, let Perl do the work.

There are other ways, some more powerful, that let you do the same thing. This Item is not about those, however, but you can read about Regex::Grammars, Parse::RecDescent on your own. Also, you’re not going to get much of a recommendation of which one you should use for your task. We don’t know your situation.

To understand this new syntax, you have to study it from the ground up. It’s not simple, and the terse documentation in perlre doesn’t do much to help.

Referencing a subpattern

The first part you need is the ability to call a named part of the pattern (to label a subpattern, see Item 31. Use named captures to label matches). To re-match a labeled subpattern, you use:

(?&NAME)

You can use that syntax to rerun a subpattern later:

use v5.10;

my $pattern = qr/
	(?<cat>Buster|Mimi)
	\s+
	(?&cat)
	/x;

foreach ( 'Buster Mimi', 'Mimi Buster', 'Buster', 'Buster Buster' ) {
	say "$_ ", m/$pattern/p ? "matched" : 'nope!';
	}

The labeled subpattern has an alternation where either cat name can match. When you reference it again, you re-run the alternation and you can match either cat name again:

Buster Mimi matched
Mimi Buster matched
Buster nope!
Buster Buster matched

This is not that same thing as matching the same text a labeled capture group already matched. That’s the \k<NAME>:

\k<NAME>

This pattern is a different beast. Whichever cat name matches first also has to match second:

use v5.10;

my $pattern = qr/
	(?<cat>Buster|Mimi)
	\s+
	\k<cat>
	/x;

foreach ( 'Buster Mimi', 'Mimi Buster', 'Buster', 'Buster Buster' ) {
	say "$_ ", m/$pattern/p ? "matched" : 'nope!';
	}

Now only one of the strings matches because only one string repeats a cat’s name:

Buster Mimi nope!
Mimi Buster nope!
Buster nope!
Buster Buster matched

Although you won’t see it here, the (?&NAME) syntax is the trick to matching a recursive pattern since the reference can appear inside the pattern it references.

Conditional match

The second building block you need starts with a conditional submatch:

(?(condition)yes-pattern|no-pattern)

That condition can be many things, most of which won’t appear in this Item. Although you see the | character, but this isn’t an alteration. It’s like an alternation because the | separates distinct subpatterns, but unlike an alternation because this will only ever try one of the subpatterns and you only get two subpatterns.

The simplest condition is just an ordinal number, which is true only if that capture group matched. Here’s a pattern that has two capture groups:

use v5.10;

my $pattern = qr/
	(?:           # parens for grouping
		(B)     # $1
		|         # alternation
		(M)     # $2
	)
	(?(1)uster|imi) # conditional match
	/x;

foreach ( qw(Mimi Buster Muster Bimi Roscoe) ) {
	say "$_ ", m/$pattern/p ? "matched ${^MATCH}" : 'nope!';
	}

In this pattern, if the (B) matches, the conditional uses uster from the conditional. Otherwise, it uses imi. However, the only thing that can match besides a (B) is the other part of the alteration, the (M). The output shows that only Mimi or Buster matches:

Mimi matched Mimi
Buster matched Buster
Muster nope!
Bimi nope!
Roscoe nope!

You get the same results if you use (2) as the condition and re-arrange the order of the patterns:

my $pattern = qr/
	(?:(B)|(M))
	(?(2)imi|uster)
	/x;

Putting it together

The condition can also be the literal (DEFINE). In that case, Perl only allows a yes-branch. And, as its condition implies, it merely defines the patterns and does not execute them.

This means that you can create and label the subpatterns that you need, but not actually assert that any of them match the string. The definitions are just there. This pattern defines and labels three subpatterns then uses none of them:

use v5.10;

my $pattern = qr/
	(?(DEFINE)
		(?<cat>Buster)
		(?<dog>Addie)
		(?<bird>Poppy)
	)
	Mimi
	/x;

foreach ( 'Buster Mimi', 'Roscoe', 'Buster', 'Mimi' ) {
	say "$_ ", m/$pattern/ ? "matched" : 'nope!';
	}

It’s as if the DEFINE bit is not even there:

Buster Mimi matched
Roscoe nope!
Buster nope!
Mimi matched

Outside the (DEFINE), you can reference any of the subpatterns that you created:

use v5.10;

my $pattern = qr/
	(?(DEFINE)
		(?<cat>Buster)
		(?<dog>Addie)
		(?<bird>Poppy)
	)
	(?&cat)
	/x;

foreach ( 'Buster Mimi', 'Roscoe', 'Buster', 'Mimi' ) {
	say "$_ ", m/$pattern/ ? "matched" : 'nope!';
	}

Now Buster matches because you reference that defined subpattern:

Buster Mimi matched
Roscoe nope!
Buster matched
Mimi nope!

Now it’s time for the grammar. Inside the (DEFINE), you can reference subpatterns you haven’t defined yet, and your subpatterns can get arbitrarily complex:

use v5.10;

my $pattern = qr/
	(?(DEFINE)
		(?<male> Buster | Roscoe )
		(?<female> Mimi | Juliet )
		(?<cat> (?&male) | (?&female) )
		(?<dog>Addie)
		(?<bird>Poppy)
	)
	(?&cat)
	/x;

foreach ( 'Addie', 'Roscoe', 'Buster', 'Mimi' ) {
	say "$_ ", m/$pattern/ ? "matched" : 'nope!';
	}

Even though the cat names are in two different subpatterns, the cat subpattern unifies them so all the cat names match:

Addie nope!
Roscoe matched
Buster matched
Mimi matched

You should now be able to understand this regular expression from Tom Christainsen (appearing Stackoverflow). You might have to pick it apart, but you know how all the parts fit together to match the Internet Message Format defined in RFC 5322:

$rfc5322 = qr{

   (?(DEFINE)

     (?<address>         (?&mailbox) | (?&group))
     (?<mailbox>         (?&name_addr) | (?&addr_spec))
     (?<name_addr>       (?&display_name)? (?&angle_addr))
     (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
     (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
     (?<display_name>    (?&phrase))
     (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)

     (?<addr_spec>       (?&local_part) \@ (?&domain))
     (?<local_part>      (?&dot_atom) | (?&quoted_string))
     (?<domain>          (?&dot_atom) | (?&domain_literal))
     (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                   \] (?&CFWS)?)
     (?<dcontent>        (?&dtext) | (?&quoted_pair))
     (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])

     (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
     (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
     (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
     (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)

     (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
     (?<quoted_pair>     \\ (?&text))

     (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
     (?<qcontent>        (?&qtext) | (?&quoted_pair))
     (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                          (?&FWS)? (?&DQUOTE) (?&CFWS)?)

     (?<word>            (?&atom) | (?&quoted_string))
     (?<phrase>          (?&word)+)

     # Folding white space
     (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
     (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
     (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
     (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
     (?<CFWS>            (?: (?&FWS)? (?&comment))*
                         (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

     # No whitespace control
     (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])

     (?<ALPHA>           [A-Za-z])
     (?<DIGIT>           [0-9])
     (?<CRLF>            \x0d \x0a)
     (?<DQUOTE>          ")
     (?<WSP>             [\x20\x09])
   )

   (?&address)

}x;

If that’s not clever enough for you, try Tom’s use of (DEFINE) to properly parse HTML.

Things to remember

  • You can reference a named subpattern with (?&NAME)
  • You can choose a subpattern with a condition (?(condition)yes-pattern|no-pattern)
  • You can define and label subpatterns for later use with (DEFINE)

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Create your own dualvars

Perl’s basic data type is the scalar, which takes its name from the mathematical term for “single item”. However, the scalar is really two things. You probably know that a scalar can be either a number or a string, or a number that looks the same as its string, or a string that can be a number. What you probably don’t know is that a scalar can be two separate and unrelated values at the same time, making it a dualvar.

You’re already using a dualvar without knowing it. The $! variable, which holds the value of the last system error, is most often used in its string form:

open my $fh, '>', $filename or die "Error: $!";

If something goes wrong with that open, you’ll get an error such as these:

No such file or directory
Permission denied

Both of those error messages have numbers associated with them. You can output their numeric value

open my $fh, '>', $filename or die $! + 0;

Now the errors are numbers, which correspond to the errno value for the system call:

2
13

These numbers are keys in the %!, and the value for that key is true if that was the last system error.

Perl, on its own, doesn’t give you a way to create this sort of variable yourself. When you assign a new value to a scalar, whether string or number, Perl clears the previous values it has. When you use a number as a string, Perl converts it to a string, and the same the other way around.

You can watch this with Devel::Peek. Here’s a program that sets a string value:

use Devel::Peek;

my $value = 'abc';

Dump( $value );

In the scalar record, the POK flag is set, indicating the variable has a string form, and the PV slot has a value (see perlguts for more details):

SV = PV(0x100801070) at 0x100827810
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x100202870 "abc"\0
  CUR = 3
  LEN = 16

If you set a numeric value, the flags are different:

use Devel::Peek;

my $value = 137;

Dump( $value );

Now there’s an IOK flag, and one of the numeric slots, in this case the IV, is set:

SV = IV(0x100827800) at 0x100827810
  REFCNT = 1
  FLAGS = (PADMY,IOK,pIOK)
  IV = 137

However, if you take the numeric value and use it in a string context, Perl also creates a string version. Now the scalar has both the IOK and POK flags, and both the IV and PV slots have values:

137
SV = PVIV(0x100809208) at 0x100827840
  REFCNT = 1
  FLAGS = (PADMY,IOK,POK,pIOK,pPOK)
  IV = 137
  PV = 0x100202870 "137"\0
  CUR = 3
  LEN = 16

If you change the variable, some of those flags disappear, even though the values don’t necessarily disappear:

use v5.10;
use Devel::Peek;

my $value = 137;

$value = 'Buster';

Dump( $value );

Now the scalar’s value should be just Buster, but the IV slot still has the old 137. However, the IOK flag is gone:

SV = PVIV(0x100809208) at 0x100827840
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  IV = 137
  PV = 0x100202870 "Buster"\0
  CUR = 6
  LEN = 16

When you set the new value, and Perl hadn’t used it in a numeric context yet, there was no need to go through the work to translate the string value to the number value. Instead, it just unset the flag that denotes its okay to use the value as a number. You have to use it as a number to again:

use v5.10;
use Devel::Peek;

my $value = 137;

$value = 'Buster';

say $value + 0;

Dump( $value );

Now Perl converts it to a number and sets the numeric flags again:

0
SV = PVNV(0x100801e30) at 0x100827840
  REFCNT = 1
  FLAGS = (PADMY,POK,pIOK,pNOK,pPOK)
  IV = 0
  NV = 0
  PV = 0x100202870 "Buster"\0
  CUR = 6
  LEN = 16

That’s how it works if you go through the Perl interface, but if you play it a scalar through the XS interface, you can set whatever flags and values that you like. That’s exactly what Scalar::Util‘s dualvar does that for you:

use v5.10;
use Devel::Peek;
use Scalar::Util qw(dualvar);

my $value = dualvar 137, 'Buster';

Dump( $value );

say "$value";
say $value + 0;

Now you have a scalar which has unrelated numeric and string values, and the flags for both values are set:

SV = PVNV(0x100802010) at 0x1008277c8
  REFCNT = 1
  FLAGS = (PADMY,IOK,POK,pIOK,pPOK)
  IV = 137
  NV = 0
  PV = 0x100202870 "Buster"\0
  CUR = 6
  LEN = 16
Buster
137

Things to remember

  • Scalars can have both numeric and string values at the same time
  • Those two values can be unrelated
  • You can create your own dualvar with Scalar::Util

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Make disposable web servers for testing

If you project depends on a interaction with a web server, especially a remote one, you have some challenges with testing that portion. Even if you can get it working for you, when you distribute your code, someone else might not be able to reach your server for testing. Instead of relying on an external server, you can use a local server that you write especially for your test suite.

This problem has a couple tricky parts. To run a test server from your own test suite, your server needs to bind to a port that’s not already in use. When it has a port, it needs to communicate that to your test script.

Another problem, which you won’t consider for this Item, involves the configurability of your program so you can change the hostname and port during the test. This Item assumes you’ve taken are of that bit.

The Test::Fake::HTTPD module can create a web server directly from your test script. This example creates a web server that returns the same JSON response for every request:

use Test::More;
use Test::Fake::HTTPD;

use Mojo::UserAgent;

my $httpd = run_http_server {
	my $request = shift;

	return [
		200,
		[ 'Content-Type' => 'application/json' ],
		[ '{ "cat": "Buster" }' ]
		];
	};

ok( defined $httpd, 'Got a web server' );

diag( sprintf "You can connect to your server at %s.\n", $httpd->host_port );

my $response = Mojo::UserAgent->new->get(
	$httpd->endpoint
	)->res;

diag( $response->to_string );
is( $response->json->{cat}, 'Buster', 'Cat is Buster' );

done_testing();

The test output shows the a message telling you the host and port, as well as the response:

ok 1 - Got a web server
# You can connect to your server at 127.0.0.1:50602.
# HTTP/1.1 200 OK
# Content-Type: application/json
# Date: Wed, 07 Dec 2011 08:44:40 GMT
# Content-Length: 19
# Server: libwww-perl-daemon/6.00
#
# { "cat": "Buster" }
ok 2 - Cat is Buster
1..2

When the test script ends (or the web server variable goes out of scope, so there’s nothing for you to cleanup. It’s a disposable web server.

In your web server, you can do anything that you like. It doesn’t have to implement everything, or even close to everything, that the production server does. It just has to return responses that you can use in your tests. That means that you get to control not only the success, but also the failures.

That Server line gives you a hint about what created the $request object—it’s LWP behind the scenes, so it’s HTTP::Request. You can get the requested path with the uri method and decide what to do:

use strict;
use warnings;

use Test::More;
use Test::Fake::HTTPD;

use Mojo::UserAgent;
use URI;

my $httpd = run_http_server {
	my $request = shift;

	my $uri = $request->uri;

	return do {
		if( $uri->path eq '/' ) {
			[
				200,
				[ 'Content-Type' => 'text/plain' ],
				[ "Ask about our cats!" ],
			]
			}
		elsif( $uri->path eq '/cats' ) {
			[
				200,
				[ 'Content-Type' => 'application/json' ],
				[ '{ "cat": "Buster" }' ],
			]
			}
		elsif( $uri->path eq '/dogs' ) {
			[
				408,
				[ 'Content-Type' => 'text/plain' ],
				[ "We don't walk dogs" ],
			]
			}
		else {
			[
				404,
				[ 'Content-Type' => 'text/plain' ],
				[ "Not Found" ],
			]
			}
		}
	};

ok( defined $httpd, 'Got a web server' );
diag( sprintf "You can connect to your server at %s.\n", $httpd->host_port );

my $uri = URI->new( $httpd->endpoint );
isa_ok( $uri, 'URI' );

subtest '/' => sub {
	plan tests => 3;

	my $this = $uri->clone;
	isa_ok( $this, 'URI' );
	$this->path( '/' );

	my $response = Mojo::UserAgent->new->get( $this )->res;

	is( $response->headers->content_type,
		'text/plain', 'Top level is plain text' );
	like( $response->body , qr/cats/, 'Top level has cats' );
	};

subtest '/cats' => sub {
	plan tests => 2;

	my $this = $uri->clone;
	isa_ok( $this, 'URI' );
	$this->path( '/cats' );

	my $response = Mojo::UserAgent->new->get( $this )->res;

	is( $response->headers->content_type,
		'application/json', '/cats returns JSON' );
	};

subtest '/not_there' => sub {
	plan tests => 2;

	my $this = $uri->clone;
	isa_ok( $this, 'URI' );
	$this->path( '/not_there' );

	my $response = Mojo::UserAgent->new->get( $this )->res;

	is( $response->code,
		'404', '/not_there returns 404' );
	};

done_testing();

With subtests organized around accesses to paths, the TAP isn’t so hard to read (although you probably won’t have to look at it yourself):

ok 1 - Got a web server
# You can connect to your server at 127.0.0.1:50498.
ok 2 - The object isa URI
    1..3
    ok 1 - The object isa URI
    ok 2 - Top level is plain text
    ok 3 - Top level has cats
ok 3 - /
    1..2
    ok 1 - The object isa URI
    ok 2 - /cats returns JSON
ok 4 - /cats
    1..2
    ok 1 - The object isa URI
    ok 2 - /not_there returns 404
ok 5 - /not_there
1..5

If you need to use the same test webserver in more than one test script, you move it into its own file.

use strict;
use warnings;

use Test::More;

use Mojo::UserAgent;
use URI;

require 'server.pl';
my $httpd = get_http_server();

ok( defined $httpd, 'Got a web server' );
diag( sprintf "You can connect to your server at %s.\n", $httpd->host_port );

my $uri = URI->new( $httpd->endpoint );
isa_ok( $uri, 'URI' );

...

The server.pl file wraps the call to run_http_server. But, if you are going to do that, you might as well skip the convenience method and set up your own object. You can change the timeout value, for instance:

use Test::Fake::HTTPD;

sub get_http_server {
	my $httpd = Test::Fake::HTTPD->new(
		timeout => 30,
		);

	$httpd->run( sub {
		my $request = shift;
		...;
		} );

	$httpd;
	}

Different test scripts each get their own test web server. If you’re running several tests scripts in parallel, you’ll start several servers at the same time, which not be that kind to your system or to the other people using it. If you don’t like that, you could set up a single server at the start of your test run, share it with all tests, and shut down everything at the end, although you won’t see that in this Item.

Things to remember

  • Test web interactions locally
  • Use Test::Fake::HTTPD to create cheap, disposable web servers

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Know split’s special cases

Perl’s split has some special cases and some perhaps surprising cases. The empty pattern, zero width match, the special argument ' ', and the /^/ act differently than you might expect from the general rule.

The empty pattern, //

The empty pattern is a special case that’s designed to give you a list of characters. This pattern specifically has nothing in it and is different than a pattern that matches an empty string (that’s next). For this, split returns a list of characters:

use utf8;

use Data::Printer;
my @characters = split //, 'Büster';

p( @characters );

The output shows a list of characters:

[
    [0] "B",
    [1] "ü",
    [2] "s",
    [3] "t",
    [4] "e",
    [5] "r"
]

This is specifically characters, not grapheme clusters. Depending on the normalization of your source code or input, you can get different results:

use utf8;

use Data::Printer;
use Unicode::Normalize qw(NFD);

my @characters = split //, NFD( 'Büster' );

p( @characters );

Now the grapheme cluster ü is actually two characters, the u (U+0075 ʟᴀᴛɪɴ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴜ) and the ¨ (U+0308 ᴄᴏᴍʙɪɴɪɴɢ ᴅɪᴀᴇʀᴇꜱɪꜱ), instead of the single ü (U+00FC ʟᴀᴛɪɴ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴜ ᴡɪᴛʜ ᴅɪᴀᴇʀᴇꜱɪꜱ):

[
    [0] "B",
    [1] "u",
    [2] "¨",
    [3] "s",
    [4] "t",
    [5] "e",
    [6] "r"
]

You can review grapheme clusters in Treat Unicode strings as grapheme clusters.

Matching the empty string

Successfully matching no characters isn’t really a special case, but people are sometimes surprised about it because it seems special. Your split pattern might match an empty string. This is different from the empty pattern because you actually have a pattern, even if it might match zero characters. This is also distinct from a pattern that doesn’t match:

use v5.10;

my $_ = 'Mimi';

say "Matched empty pattern" if //;
say "Matched optional whitespace" if /\s*/;
say "Matched zero width assertion" if /(?=\w+)/;
say "How did Buster match?" if /Buster/;

The first three of these patterns match successfully but matches zero characters, while the fourth fails:

Matched empty pattern
Matched optional whitespace
Matched zero width assertion

It’s easy to construct a pattern that will match zero characters even though it matches successfully. The ? (zero or one) and * (zero or more) quantifiers do that quite nicely. Zero width assertions, such as the boundaries and lookarounds, do that too. If the pattern can match zero characters successfully, Perl splits into characters:

use Data::Printer;
my @characters = split /\s*/, 'Buster';

p( @characters );
[
    [0] "B",
    [1] "u",
    [2] "s",
    [3] "t",
    [4] "e",
    [5] "r"
]

The pattern doesn’t have to match zero characters for all separators.

use Data::Printer;
my @characters = split /\s*/, 'Buster and Mimi';

p( @characters );

Notice that there are no spaces in @characters, since split matched those as separator characters:

[
    [0]  "B",
    [1]  "u",
    [2]  "s",
    [3]  "t",
    [4]  "e",
    [5]  "r",
    [6]  "a",
    [7]  "n",
    [8]  "d",
    [9]  "M",
    [10] "i",
    [11] "m",
    [12] "i"
]

The single space, ‘ ‘

The single space in quotes, single or double, is a special case. It splits on whitespace, but unlike the pattern that is a single space, the one in quotes discards empty leading fields:

use Data::Printer;
my @characters = split ' ', '  Buster and Mimi';

p( @characters );

You get just the non-whitespace with no empty fields:

[
    [0] "Buster",
    [1] "and",
    [2] "Mimi"
]

This behavior comes from awk:

#!/usr/bin/awk -f
BEGIN {
    string="  Buster Mimi Roscoe";
    search=" ";
    n=split(string,array," ");
    print("[");
    for (i=1;i<=n;i++) {
        printf("    [%d] \"%s\"\n",i,array[i]);
    }
    print("]");
    exit;
}

You end up with almost the same input, although the indices are one greater:

[
    [1] "Buster"
    [2] "Mimi"
    [3] "Roscoe"
]

Back in Perl, if you tried that with the normal match operator delimiters, you get a different result:

use Data::Printer;
my @characters = split / /, '  Buster and Mimi';

p( @characters );

This time you kept the empty leading fields:

[
    [0] "",
    [1] "",
    [2] "Buster",
    [3] "and",
    [4] "Mimi"
]

If you include the m in front of the quotes though, you lose the special magic:

use Data::Printer;
my @characters = split m' ', '  Buster and Mimi';

p( @characters );

The empty leading fields are back:

[
    [0] "",
    [1] "",
    [2] "Buster",
    [3] "and",
    [4] "Mimi"
]

Splitting lines

The special pattern of just the beginning-of-line anchor, even without the /m flag, breaks a multi-line string into lines:

use Data::Printer;

my $string = <<'HERE';
Line one
Line two
Line three
HERE

my @lines = split /^/, $string;

p( @lines );

Even without the /m you get separate lines:

[
    [0] "Line one
",
    [1] "Line two
",
    [2] "Line three
"
]

This only works if the pattern is exactly /^/. If you put anything else in the pattern, you don't get the special behavior, even if it's a zero width match:

...; # same as before

my @lines = split /^(?=Line)/, $string;  # Oops

p( @lines );

Now there's only one field:

[
    [0] "Line one
Line two
Line three
Line four
"
]

Things to remember

  • The empty pattern // splits on characters, but not grapheme clusters
  • A zero width successful match splits on characters too
  • The single space in quotes splits on whitespace and discards leading empty fields
  • The ^ anchor by itself splits into lines, even without the /m

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit