Understand the order of operations in double quoted contexts

Perl’s powerful string manipulation tools include case-shifting operators that change the parts of a double-quoted string. There are many other things that happen in a double-quoted string too, so you need to know where these operators fit in with each other.

A double-quoted string has three features:

  • Variable interpolation
  • Escaped and logical characters
  • Case shift operators

You might have missed this because the documentation doesn’t emphasize it. There is a single sentence in perlop, but in relation to the regular expression operators and the \Q:

For double-quoted strings, the quoting from \Q is applied after interpolation and escapes are processed.

If you don’t pay attention to the order of these operations, you’ll get results that you might not expect. The problem is that the order of operations isn’t the same in all double-quoted contexts.

In strings, the order of operations is the same as listed earlier:

  • Variable interpolation
  • Escaped and logical characters
  • Case shift operators

Variable interpolation

You already know about variable interpolation. This is one of Perl’s greatest features, and the one I miss the most when I have to use a different language:

my $cat = 'Roscoe';
my $string = "Buster $cat Mimi";

In a double quoted context, Perl substitutes the value of $cat. You end up with Buster Roscoe Mimi.

Case-shift operators

The case-shift operators change parts of a double-quoted string. Although we call them “case shift”, not all of them change the case.

Operator Effect Function equivalent
\U Uppercase everything following uc
\u Uppercase the next character ucfirst
\L Lowercase everything following lc
\l Lowercase the next character lcfirst
\F (v5.16) Lowercase everything following fc
\Q Quote metacharacters quotemeta
\E Stop whatever you were doing

The \F and fc are new for the yet unreleased Perl v5.16. Those will show up in a different Item. Notice there’s no \f for a fcfirst. That double-quoted sequence already means “form feed”, the instruction to printers to stop the current page and start a new page.

Look at some examples using these in a double-quoted string:

% perl -e 'print "\ubuster\n"'
Buster
% perl -e 'print "\LBUSTER\n"'
buster
% perl -e 'print "\Ubuster\n"'
BUSTER
% perl -e 'print "\Ubus\Eter\n"'
BUSter
% perl -e 'print "\LBUST\EER\n"'
bustER
% perl -e 'print "\QP*rl\n"'
P\*rl\

That last one is a bit odd. It looks like it ends with a \. It doesn’t really end like that because there’s a newline that \Q quoted:

% perl -e 'print "\QP*rl\n"' | hexdump -C
00000000  50 5c 2a 72 6c 5c 0a                 |P\*rl\.|
00000007

Perl handled the “\n” before it handled the \Q, but the meta-character quoter thinks the newline is a special character so it escapes it. An escaped newline is just a newline, though.

Now, combine these with variable interpolation. Perl handles the variables first then does the case shifting:

use 5.14.1;

my $cat = 'Buster';

say "Roscoe $cat Mimi";
say "Roscoe \U$cat Mimi";
say "Roscoe \U$cat\E Mimi";

The results are probably not surprising. The first line is just interpolation, the second line uppercases everything from \U to the end, and the third line uppercases only the parts between the \U and the \E:

Roscoe Buster Mimi
Roscoe BUSTER MIMI
Roscoe BUSTER Mimi

If the case shift happens after interpolation, you might think that you could interpolate a case shift:

use 5.14.1;

my $cat = '\UBuster'; # no case shift in a single quote!

say "Roscoe $cat Mimi";

That doesn’t work though. The intended case shift operator shows up as literal characters because Perl doesn’t do double processing:

Roscoe \UBuster Mimi

A \U inside the string doesn’t bother the escaped characters because Perl has already processed those:

use 5.14.1;

my $cat = 'Buster';

say "Roscoe \U$cat\a\n Mimi";

The “\n” is still a newline and the “\a” is still the bell, and everything after the \U is uppercased (if it has an uppercase equivalent).

That seems simple enough. It’s variable interpolation followed by character escapes followed by case shifting. But this is Perl, so it can’t be that easy.

Regular expression double quoting

The regular expression operators (qr, m//, and s///) handle the double quote operations differently. From perlop:

For the pattern of regex operators (qr//, m// and s///), the quoting from \Q is applied after interpolation is processed, but before escapes are processed.

Now the order is of operations is:

  • Variable interpolation
  • Case shift operators
  • Escaped and logical characters

You can see this when you print the stringified forms of the patterns:

% perl -le 'print qr/\Q\n/'
(?-xism:\\n)
% perl -le 'print qr/\U\n/'
(?-xism:\N)

You probably expect all of these to match, but not all of them do:

% perl -le 'print "\n" =~ qr/\n/ ? "Yes" : "No"'
Yes
% perl -le 'print "\n" =~ qr/\Q\n/ ? "Yes" : "No"'
No
% perl -le 'print "\n" =~ qr/\U\n/ ? "Yes" : "No"'
No
% perl -le 'print "\n" =~ qr/\l\n/ ? "Yes" : "No"'
Yes
% perl -le 'print "\n" =~ qr/\L\n/ ? "Yes" : "No"'
Yes

The last two times are curious. The \l and \L leave the n as a lowercase n so in the last step, the \n is still a newline. Those two tests still match.

This means that you can construct a string and a pattern with the same sequence of characters, but they might not match:

% perl -le 'print "\Q\n" =~ qr/\Q\n/ ? "Yes" : "No"'
No
% perl -le 'print "\U\n" =~ qr/\U\n/ ? "Yes" : "No"'
No
% perl -le 'print "\L\n" =~ qr/\L\n/ ? "Yes" : "No"'
Yes

It’s even worse. What does the \N mean? It depends on the Perl version:

% perl5.10.1 -le 'print "\n" =~ qr/\N/ ? "Yes" : "No"'
Missing braces on \N{} in regex; marked by <-- HERE in m/\N <-- HERE / at -e line 1.
% perl5.12.1 -le 'print "\n" =~ qr/\N/ ? "Yes" : "No"'
No
% perl5.14.1 -le 'print "\n" =~ qr/\N/ ? "Yes" : "No"'
No

Perl v5.12 added \N as "not a newline" to replace the . no matter which default regex switches you have. That's why Perl v5.10 thinks you have an incomplete \N{CHARNAME}. The others match a newline because the case shift happens in the middle of the process:

% perl5.10.1 -le 'print "\n" =~ qr/\L\N/ ? "Yes" : "No"'
Yes
% perl5.14.1 -le 'print "\n" =~ qr/\L\N/ ? "Yes" : "No"'
Yes
% perl5.8.9 -le 'print "\n" =~ qr/\L\N/ ? "Yes" : "No"'
Missing braces on \N{} at -e line 1, near "\L"
Execution of -e aborted due to compilation errors.

With the \N{CHARNAME} syntax, you can match characters by their name in the Universal Character Set. Here you match an uppercase A:

% perl -Mcharnames=:full -le 'print "A" =~ qr/\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
Yes

If you put a \L in front of that, you might think it would match the lowercase version of the named letter. There's no such luck because the \L affects the pattern before the \N{CHARNAME}:

% perl -Mcharnames=:full -le 'print "A" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
No
% perl -Mcharnames=:full -le 'print "a" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
No

The \N turns into a newline and the braces are now for a quantifier with a non-number in it:

% perl -Mcharnames=:full -le 'print qr/\L\N{LATIN CAPITAL LETTER A}/'
(?-xism:\n{u+41})

You might think that this would match anything since that should probably turn into \n{0} just like the values in the array index turn into integers. The perlre section on "Quantifiers" don't say what should happen, but if it's not a number, the braces become literals. Here's a simple demonstration that those braces are literals:

% perl -le 'print "\n{a}" =~ qr/\n{a}/ ? "Yes" : "No"'
Yes

Here's the pattern you created before, and that you want to match now:

% perl -Mcharnames=:full -le 'print qr/\L\N{LATIN CAPITAL LETTER A}/'
(?^u:\n{u+41})

It doesn't match a lowercase a:

$ perl -Mcharnames=:full -le 'print "a" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
No

It doesn't match a newline either. The pattern in \n{u+41} and that's not a quantifier. There are some characters after the \n, so the target string doesn't have enough characters to match:

% perl -Mcharnames=:full -le 'print "\n" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
No

Using the regular expression text doesn't work either, which you might miss on the first pass:

% perl -Mcharnames=:full -le 'print "\n{u+41}" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
No

Of course! That + is a quantifier, so it isn't a literal character that should show up in the string. So this works:

% perl -Mcharnames=:full -le 'print "\n{u41}" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
Yes

This works too because you can have one or more of u:

% perl -Mcharnames=:full -le 'print "\n{uuuuu41}" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
Yes

If you don't want the \L to extend into character name sequence, you can use the \E to limit its effect:

% perl -Mcharnames=:full -le 'print "bar" =~ qr/\LB\E\N{LATIN SMALL LETTER A}r/ ? "Yes" : "No"'
Yes

Things to remember

  • The double quote string constructor handles variable interpolation, special characters, and case shift operators in that order.
  • The regular expression operators handles variable interpolation, case shift operators, and special characters in that order.
  • Double-quoted interpolation in a match operator happens before regular expression compilation.
  • The min-max quantifier is only a quantifier if you give it numbers. Otherwise, it's literal characters.

The Effective Perler in 2012 and beyond

Two years ago, Josh McAdams and I started The Effective Perler as an extension of the second edition of Effective Perl Programming. Since then, roughly once a week, we added one meaty Item a week. Last month, we published our 100th Item. With the 120 Items in the book, that’s a lot of items.

I have a new plan for 2012 onward. It’s much harder to find topics now and it takes much longer to research and write them. I’ve exhausted all of the advice I have and all of the easy topics. When I think I have a good idea, I now know to search everything else I’ve already written. More than a couple of times I thought I had the next week’s idea, but it was already in the book or on the website.

The search for content has another problem: I don’t want to add Items for anything that’s already been written—not just by me, but by anyone. I don’t want to repeat content unless I have a different take on it and I can illuminate something new.

I’m not going to do weekly big Items anymore. I’ll try one a month, I think. I’ll see how it goes.

There’s still roomer for shorter content, such as the short demonstrations of new features, and ideas that don’t have have 1,000 words in them. There are also many interesting, although esoteric, features that I would probably never recommend for production code.

This doesn’t mean that I’m going to write less, though. If I do less for The Effective Perler, I can do more somewhere else. There another edition of Intermediate Perl that needs attention, the Learning Perl website, or a few other things.

Many people have asked for Items about specific modules, but that’s not really the idea of The Effective Perler. We want to teach people about core Perl and thinking in Perl. Modules are essentially all the same—they give you an interface and you do what the interface tells you to do. For the most part, they are just subroutines or method calls. There’s not much interesting there. You already know how to do that. The much more interesting advice is researching modules, but we put that in the book already.

Define grammars in regular expressions

[ This is the 100th Item we've shared with you in the two years this blog has been around. We deserve a holiday and we're taking it, so read us next year! Happy Holidays.]

Perl 5.10 added rudimentary grammar support in its regular expressions. You could define many subpatterns directly in your pattern, use them to define larger subpatterns, and, finally, when you have everything in place, let Perl do the work.

There are other ways, some more powerful, that let you do the same thing. This Item is not about those, however, but you can read about Regex::Grammars, Parse::RecDescent on your own. Also, you’re not going to get much of a recommendation of which one you should use for your task. We don’t know your situation.

To understand this new syntax, you have to study it from the ground up. It’s not simple, and the terse documentation in perlre doesn’t do much to help.

Referencing a subpattern

The first part you need is the ability to call a named part of the pattern (to label a subpattern, see Item 31. Use named captures to label matches). To re-match a labeled subpattern, you use:

(?&NAME)

You can use that syntax to rerun a subpattern later:

use v5.10;

my $pattern = qr/
	(?<cat>Buster|Mimi)
	\s+
	(?&cat)
	/x;

foreach ( 'Buster Mimi', 'Mimi Buster', 'Buster', 'Buster Buster' ) {
	say "$_ ", m/$pattern/p ? "matched" : 'nope!';
	}

The labeled subpattern has an alternation where either cat name can match. When you reference it again, you re-run the alternation and you can match either cat name again:

Buster Mimi matched
Mimi Buster matched
Buster nope!
Buster Buster matched

This is not that same thing as matching the same text a labeled capture group already matched. That’s the \k<NAME>:

\k<NAME>

This pattern is a different beast. Whichever cat name matches first also has to match second:

use v5.10;

my $pattern = qr/
	(?<cat>Buster|Mimi)
	\s+
	\k<cat>
	/x;

foreach ( 'Buster Mimi', 'Mimi Buster', 'Buster', 'Buster Buster' ) {
	say "$_ ", m/$pattern/p ? "matched" : 'nope!';
	}

Now only one of the strings matches because only one string repeats a cat’s name:

Buster Mimi nope!
Mimi Buster nope!
Buster nope!
Buster Buster matched

Although you won’t see it here, the (?&NAME) syntax is the trick to matching a recursive pattern since the reference can appear inside the pattern it references.

Conditional match

The second building block you need starts with a conditional submatch:

(?(condition)yes-pattern|no-pattern)

That condition can be many things, most of which won’t appear in this Item. Although you see the | character, but this isn’t an alteration. It’s like an alternation because the | separates distinct subpatterns, but unlike an alternation because this will only ever try one of the subpatterns and you only get two subpatterns.

The simplest condition is just an ordinal number, which is true only if that capture group matched. Here’s a pattern that has two capture groups:

use v5.10;

my $pattern = qr/
	(?:           # parens for grouping
		(B)     # $1
		|         # alternation
		(M)     # $2
	)     
	(?(1)uster|imi) # conditional match
	/x;

foreach ( qw(Mimi Buster Muster Bimi Roscoe) ) {
	say "$_ ", m/$pattern/p ? "matched ${^MATCH}" : 'nope!';
	}

In this pattern, if the (B) matches, the conditional uses uster from the conditional. Otherwise, it uses imi. However, the only thing that can match besides a (B) is the other part of the alteration, the (M). The output shows that only Mimi or Buster matches:

Mimi matched Mimi
Buster matched Buster
Muster nope!
Bimi nope!
Roscoe nope!

You get the same results if you use (2) as the condition and re-arrange the order of the patterns:

my $pattern = qr/
	(?:(B)|(M))
	(?(2)imi|uster)
	/x;

Putting it together

The condition can also be the literal (DEFINE). In that case, Perl only allows a yes-branch. And, as its condition implies, it merely defines the patterns and does not execute them.

This means that you can create and label the subpatterns that you need, but not actually assert that any of them match the string. The definitions are just there. This pattern defines and labels three subpatterns then uses none of them:

use v5.10;

my $pattern = qr/
	(?(DEFINE)
		(?<cat>Buster)
		(?<dog>Addie)
		(?<bird>Poppy)
	)
	Mimi
	/x;

foreach ( 'Buster Mimi', 'Roscoe', 'Buster', 'Mimi' ) {
	say "$_ ", m/$pattern/ ? "matched" : 'nope!';
	}

It’s as if the DEFINE bit is not even there:

Buster Mimi matched
Roscoe nope!
Buster nope!
Mimi matched

Outside the (DEFINE), you can reference any of the subpatterns that you created:

use v5.10;

my $pattern = qr/
	(?(DEFINE)
		(?<cat>Buster)
		(?<dog>Addie)
		(?<bird>Poppy)
	)
	(?&cat)
	/x;

foreach ( 'Buster Mimi', 'Roscoe', 'Buster', 'Mimi' ) {
	say "$_ ", m/$pattern/ ? "matched" : 'nope!';
	}

Now Buster matches because you reference that defined subpattern:

Buster Mimi matched
Roscoe nope!
Buster matched
Mimi nope!

Now it’s time for the grammar. Inside the (DEFINE), you can reference subpatterns you haven’t defined yet, and your subpatterns can get arbitrarily complex:

use v5.10;

my $pattern = qr/
	(?(DEFINE)
		(?<male> Buster | Roscoe )
		(?<female> Mimi | Juliet )
		(?<cat> (?&male) | (?&female) )
		(?<dog>Addie)
		(?<bird>Poppy)
	)
	(?&cat)
	/x;

foreach ( 'Addie', 'Roscoe', 'Buster', 'Mimi' ) {
	say "$_ ", m/$pattern/ ? "matched" : 'nope!';
	}

Even though the cat names are in two different subpatterns, the cat subpattern unifies them so all the cat names match:

Addie nope!
Roscoe matched
Buster matched
Mimi matched

You should now be able to understand this regular expression from Tom Christainsen (appearing Stackoverflow). You might have to pick it apart, but you know how all the parts fit together to match the Internet Message Format defined in RFC 5322:

$rfc5322 = qr{

   (?(DEFINE)

     (?<address>         (?&mailbox) | (?&group))
     (?<mailbox>         (?&name_addr) | (?&addr_spec))
     (?<name_addr>       (?&display_name)? (?&angle_addr))
     (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
     (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
     (?<display_name>    (?&phrase))
     (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)

     (?<addr_spec>       (?&local_part) \@ (?&domain))
     (?<local_part>      (?&dot_atom) | (?&quoted_string))
     (?<domain>          (?&dot_atom) | (?&domain_literal))
     (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                   \] (?&CFWS)?)
     (?<dcontent>        (?&dtext) | (?&quoted_pair))
     (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])

     (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
     (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
     (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
     (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)

     (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
     (?<quoted_pair>     \\ (?&text))

     (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
     (?<qcontent>        (?&qtext) | (?&quoted_pair))
     (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                          (?&FWS)? (?&DQUOTE) (?&CFWS)?)

     (?<word>            (?&atom) | (?&quoted_string))
     (?<phrase>          (?&word)+)

     # Folding white space
     (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
     (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
     (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
     (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
     (?<CFWS>            (?: (?&FWS)? (?&comment))*
                         (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

     # No whitespace control
     (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])

     (?<ALPHA>           [A-Za-z])
     (?<DIGIT>           [0-9])
     (?<CRLF>            \x0d \x0a)
     (?<DQUOTE>          ")
     (?<WSP>             [\x20\x09])
   )

   (?&address)

}x;

If that’s not clever enough for you, try Tom’s use of (DEFINE) to properly parse HTML.

Things to remember

  • You can reference a named subpattern with (?&NAME)
  • You can choose a subpattern with a condition (?(condition)yes-pattern|no-pattern)
  • You can define and label subpatterns for later use with (DEFINE)

Create your own dualvars

Perl’s basic data type is the scalar, which takes its name from the mathematical term for “single item”. However, the scalar is really two things. You probably know that a scalar can be either a number or a string, or a number that looks the same as its string, or a string that can be a number. What you probably don’t know is that a scalar can be two separate and unrelated values at the same time, making it a dualvar.

You’re already using a dualvar without knowing it. The $! variable, which holds the value of the last system error, is most often used in its string form:

open my $fh, '>', $filename or die "Error: $!";

If something goes wrong with that open, you’ll get an error such as these:

No such file or directory
Permission denied

Both of those error messages have numbers associated with them. You can output their numeric value

open my $fh, '>', $filename or die $! + 0;

Now the errors are numbers, which correspond to the errno value for the system call:

2
13

These numbers are keys in the %!, and the value for that key is true if that was the last system error.

Perl, on its own, doesn’t give you a way to create this sort of variable yourself. When you assign a new value to a scalar, whether string or number, Perl clears the previous values it has. When you use a number as a string, Perl converts it to a string, and the same the other way around.

You can watch this with Devel::Peek. Here’s a program that sets a string value:

use Devel::Peek;

my $value = 'abc';

Dump( $value );

In the scalar record, the POK flag is set, indicating the variable has a string form, and the PV slot has a value (see perlguts for more details):

SV = PV(0x100801070) at 0x100827810
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x100202870 "abc"\0
  CUR = 3
  LEN = 16

If you set a numeric value, the flags are different:

use Devel::Peek;

my $value = 137;

Dump( $value );

Now there’s an IOK flag, and one of the numeric slots, in this case the IV, is set:

SV = IV(0x100827800) at 0x100827810
  REFCNT = 1
  FLAGS = (PADMY,IOK,pIOK)
  IV = 137

However, if you take the numeric value and use it in a string context, Perl also creates a string version. Now the scalar has both the IOK and POK flags, and both the IV and PV slots have values:

137
SV = PVIV(0x100809208) at 0x100827840
  REFCNT = 1
  FLAGS = (PADMY,IOK,POK,pIOK,pPOK)
  IV = 137
  PV = 0x100202870 "137"\0
  CUR = 3
  LEN = 16

If you change the variable, some of those flags disappear, even though the values don’t necessarily disappear:

use v5.10;
use Devel::Peek;

my $value = 137;

$value = 'Buster';

Dump( $value );

Now the scalar’s value should be just Buster, but the IV slot still has the old 137. However, the IOK flag is gone:

SV = PVIV(0x100809208) at 0x100827840
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  IV = 137
  PV = 0x100202870 "Buster"\0
  CUR = 6
  LEN = 16

When you set the new value, and Perl hadn’t used it in a numeric context yet, there was no need to go through the work to translate the string value to the number value. Instead, it just unset the flag that denotes its okay to use the value as a number. You have to use it as a number to again:

use v5.10;
use Devel::Peek;

my $value = 137;

$value = 'Buster';

say $value + 0;

Dump( $value );

Now Perl converts it to a number and sets the numeric flags again:

0
SV = PVNV(0x100801e30) at 0x100827840
  REFCNT = 1
  FLAGS = (PADMY,POK,pIOK,pNOK,pPOK)
  IV = 0
  NV = 0
  PV = 0x100202870 "Buster"\0
  CUR = 6
  LEN = 16

That’s how it works if you go through the Perl interface, but if you play it a scalar through the XS interface, you can set whatever flags and values that you like. That’s exactly what Scalar::Util‘s dualvar does that for you:

use v5.10;
use Devel::Peek;
use Scalar::Util qw(dualvar);

my $value = dualvar 137, 'Buster';

Dump( $value );

say "$value";
say $value + 0;

Now you have a scalar which has unrelated numeric and string values, and the flags for both values are set:

SV = PVNV(0x100802010) at 0x1008277c8
  REFCNT = 1
  FLAGS = (PADMY,IOK,POK,pIOK,pPOK)
  IV = 137
  NV = 0
  PV = 0x100202870 "Buster"\0
  CUR = 6
  LEN = 16
Buster
137

Things to remember

  • Scalars can have both numeric and string values at the same time
  • Those two values can be unrelated
  • You can create your own dualvar with Scalar::Util

Make disposable web servers for testing

If you project depends on a interaction with a web server, especially a remote one, you have some challenges with testing that portion. Even if you can get it working for you, when you distribute your code, someone else might not be able to reach your server for testing. Instead of relying on an external server, you can use a local server that you write especially for your test suite.

This problem has a couple tricky parts. To run a test server from your own test suite, your server needs to bind to a port that’s not already in use. When it has a port, it needs to communicate that to your test script.

Another problem, which you won’t consider for this Item, involves the configurability of your program so you can change the hostname and port during the test. This Item assumes you’ve taken are of that bit.

The Test::Fake::HTTPD module can create a web server directly from your test script. This example creates a web server that returns the same JSON response for every request:

use Test::More;
use Test::Fake::HTTPD;

use Mojo::UserAgent;

my $httpd = run_http_server {
	my $request = shift;

	return [ 
		200, 
		[ 'Content-Type' => 'application/json' ], 
		[ '{ "cat": "Buster" }' ] 
		];
	};

ok( defined $httpd, 'Got a web server' );

diag( sprintf "You can connect to your server at %s.\n", $httpd->host_port );

my $response = Mojo::UserAgent->new->get(
	$httpd->endpoint
	)->res;

diag( $response->to_string );
is( $response->json->{cat}, 'Buster', 'Cat is Buster' );

done_testing();

The test output shows the a message telling you the host and port, as well as the response:

ok 1 - Got a web server
# You can connect to your server at 127.0.0.1:50602.
# HTTP/1.1 200 OK
# Content-Type: application/json
# Date: Wed, 07 Dec 2011 08:44:40 GMT
# Content-Length: 19
# Server: libwww-perl-daemon/6.00
# 
# { "cat": "Buster" }
ok 2 - Cat is Buster
1..2

When the test script ends (or the web server variable goes out of scope, so there’s nothing for you to cleanup. It’s a disposable web server.

In your web server, you can do anything that you like. It doesn’t have to implement everything, or even close to everything, that the production server does. It just has to return responses that you can use in your tests. That means that you get to control not only the success, but also the failures.

That Server line gives you a hint about what created the $request object—it’s LWP behind the scenes, so it’s HTTP::Request. You can get the requested path with the uri method and decide what to do:

use strict;
use warnings;

use Test::More;
use Test::Fake::HTTPD;

use Mojo::UserAgent;
use URI;

my $httpd = run_http_server {
	my $request = shift;

	my $uri = $request->uri;

	return do {
		if( $uri->path eq '/' ) {
			[ 
				200, 
				[ 'Content-Type' => 'text/plain' ], 
				[ "Ask about our cats!" ], 
			]
			}
		elsif( $uri->path eq '/cats' ) {
			[ 
				200, 
				[ 'Content-Type' => 'application/json' ], 
				[ '{ "cat": "Buster" }' ], 
			]
			}
		elsif( $uri->path eq '/dogs' ) {
			[ 
				408, 
				[ 'Content-Type' => 'text/plain' ], 
				[ "We don't walk dogs" ], 
			]
			}
		else {
			[ 
				404, 
				[ 'Content-Type' => 'text/plain' ], 
				[ "Not Found" ], 
			]
			}
		}		
	};

ok( defined $httpd, 'Got a web server' );
diag( sprintf "You can connect to your server at %s.\n", $httpd->host_port );

my $uri = URI->new( $httpd->endpoint );
isa_ok( $uri, 'URI' );


subtest '/' => sub {
	plan tests => 3;
	
	my $this = $uri->clone;
	isa_ok( $this, 'URI' );
	$this->path( '/' );
	
	my $response = Mojo::UserAgent->new->get( $this )->res;
	
	is( $response->headers->content_type, 
		'text/plain', 'Top level is plain text' );
	like( $response->body , qr/cats/, 'Top level has cats' );
	};
	
subtest '/cats' => sub {
	plan tests => 2;
	
	my $this = $uri->clone;
	isa_ok( $this, 'URI' );
	$this->path( '/cats' );
	
	my $response = Mojo::UserAgent->new->get( $this )->res;
	
	is( $response->headers->content_type, 
		'application/json', '/cats returns JSON' );
	};

subtest '/not_there' => sub {
	plan tests => 2;
	
	my $this = $uri->clone;
	isa_ok( $this, 'URI' );
	$this->path( '/not_there' );
	
	my $response = Mojo::UserAgent->new->get( $this )->res;

	is( $response->code, 
		'404', '/not_there returns 404' );
	};

done_testing();

With subtests organized around accesses to paths, the TAP isn’t so hard to read (although you probably won’t have to look at it yourself):

ok 1 - Got a web server
# You can connect to your server at 127.0.0.1:50498.
ok 2 - The object isa URI
    1..3
    ok 1 - The object isa URI
    ok 2 - Top level is plain text
    ok 3 - Top level has cats
ok 3 - /
    1..2
    ok 1 - The object isa URI
    ok 2 - /cats returns JSON
ok 4 - /cats
    1..2
    ok 1 - The object isa URI
    ok 2 - /not_there returns 404
ok 5 - /not_there
1..5

If you need to use the same test webserver in more than one test script, you move it into its own file.

use strict;
use warnings;

use Test::More;

use Mojo::UserAgent;
use URI;

require 'server.pl';
my $httpd = get_http_server();

ok( defined $httpd, 'Got a web server' );
diag( sprintf "You can connect to your server at %s.\n", $httpd->host_port );

my $uri = URI->new( $httpd->endpoint );
isa_ok( $uri, 'URI' );

...

The server.pl file wraps the call to run_http_server. But, if you are going to do that, you might as well skip the convenience method and set up your own object. You can change the timeout value, for instance:

use Test::Fake::HTTPD;

sub get_http_server {
	my $httpd = Test::Fake::HTTPD->new(
		timeout => 30,
		);

	$httpd->run( sub {
		my $request = shift;
		...;		
		} );
		
	$httpd;
	}

Different test scripts each get their own test web server. If you’re running several tests scripts in parallel, you’ll start several servers at the same time, which not be that kind to your system or to the other people using it. If you don’t like that, you could set up a single server at the start of your test run, share it with all tests, and shut down everything at the end, although you won’t see that in this Item.

Things to remember

  • Test web interactions locally
  • Use Test::Fake::HTTPD to create cheap, disposable web servers

Know split’s special cases

Perl’s split has some special cases and some perhaps surprising cases. The empty pattern, zero width match, the special argument ' ', and the /^/ act differently than you might expect from the general rule.

The empty pattern, //

The empty pattern is a special case that’s designed to give you a list of characters. This pattern specifically has nothing in it and is different than a pattern that matches an empty string (that’s next). For this, split returns a list of characters:

use utf8;

use Data::Printer;
my @characters = split //, 'Büster';

p( @characters );

The output shows a list of characters:

[
    [0] "B",
    [1] "ü",
    [2] "s",
    [3] "t",
    [4] "e",
    [5] "r"
]

This is specifically characters, not grapheme clusters. Depending on the normalization of your source code or input, you can get different results:

use utf8;

use Data::Printer;
use Unicode::Normalize qw(NFD);

my @characters = split //, NFD( 'Büster' );

p( @characters );

Now the grapheme cluster ü is actually two characters, the u (U+0075 ʟᴀᴛɪɴ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴜ) and the ¨ (U+0308 ᴄᴏᴍʙɪɴɪɴɢ ᴅɪᴀᴇʀᴇꜱɪꜱ), instead of the single ü (U+00FC ʟᴀᴛɪɴ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴜ ᴡɪᴛʜ ᴅɪᴀᴇʀᴇꜱɪꜱ):

[
    [0] "B",
    [1] "u",
    [2] "¨",
    [3] "s",
    [4] "t",
    [5] "e",
    [6] "r"
]

You can review grapheme clusters in Treat Unicode strings as grapheme clusters.

Matching the empty string

Successfully matching no characters isn’t really a special case, but people are sometimes surprised about it because it seems special. Your split pattern might match an empty string. This is different from the empty pattern because you actually have a pattern, even if it might match zero characters. This is also distinct from a pattern that doesn’t match:

use v5.10;

my $_ = 'Mimi';

say "Matched empty pattern" if //;
say "Matched optional whitespace" if /\s*/;
say "Matched zero width assertion" if /(?=\w+)/;
say "How did Buster match?" if /Buster/;

The first three of these patterns match successfully but matches zero characters, while the fourth fails:

Matched empty pattern
Matched optional whitespace
Matched zero width assertion

It’s easy to construct a pattern that will match zero characters even though it matches successfully. The ? (zero or one) and * (zero or more) quantifiers do that quite nicely. Zero width assertions, such as the boundaries and lookarounds, do that too. If the pattern can match zero characters successfully, Perl splits into characters:

use Data::Printer;
my @characters = split /\s*/, 'Buster';

p( @characters );
[
    [0] "B",
    [1] "u",
    [2] "s",
    [3] "t",
    [4] "e",
    [5] "r"
]

The pattern doesn’t have to match zero characters for all separators.

use Data::Printer;
my @characters = split /\s*/, 'Buster and Mimi';

p( @characters );

Notice that there are no spaces in @characters, since split matched those as separator characters:

[
    [0]  "B",
    [1]  "u",
    [2]  "s",
    [3]  "t",
    [4]  "e",
    [5]  "r",
    [6]  "a",
    [7]  "n",
    [8]  "d",
    [9]  "M",
    [10] "i",
    [11] "m",
    [12] "i"
]

The single space, ‘ ‘

The single space in quotes, single or double, is a special case. It splits on whitespace, but unlike the pattern that is a single space, the one in quotes discards empty leading fields:

use Data::Printer;
my @characters = split ' ', '  Buster and Mimi';

p( @characters );

You get just the non-whitespace with no empty fields:

[
    [0] "Buster",
    [1] "and",
    [2] "Mimi"
]

This behavior comes from awk:

#!/usr/bin/awk -f
BEGIN {
    string="  Buster Mimi Roscoe";
    search=" ";
    n=split(string,array," ");
    print("[");
    for (i=1;i<=n;i++) {
        printf("    [%d] \"%s\"\n",i,array[i]);
    }
    print("]");
    exit;
}

You end up with almost the same input, although the indices are one greater:

[
    [1] "Buster"
    [2] "Mimi"
    [3] "Roscoe"
]

Back in Perl, if you tried that with the normal match operator delimiters, you get a different result:

use Data::Printer;
my @characters = split / /, '  Buster and Mimi';

p( @characters );

This time you kept the empty leading fields:

[
    [0] "",
    [1] "",
    [2] "Buster",
    [3] "and",
    [4] "Mimi"
]

If you include the m in front of the quotes though, you lose the special magic:

use Data::Printer;
my @characters = split m' ', '  Buster and Mimi';

p( @characters );

The empty leading fields are back:

[
    [0] "",
    [1] "",
    [2] "Buster",
    [3] "and",
    [4] "Mimi"
]

Splitting lines

The special pattern of just the beginning-of-line anchor, even without the /m flag, breaks a multi-line string into lines:

use Data::Printer;

my $string = <<'HERE';
Line one
Line two
Line three
HERE

my @lines = split /^/, $string;

p( @lines );

Even without the /m you get separate lines:

[
    [0] "Line one
",
    [1] "Line two
",
    [2] "Line three
"
]

This only works if the pattern is exactly /^/. If you put anything else in the pattern, you don't get the special behavior, even if it's a zero width match:

...; # same as before

my @lines = split /^(?=Line)/, $string;  # Oops

p( @lines );

Now there's only one field:

[
    [0] "Line one
Line two
Line three
Line four
"
]

Things to remember

  • The empty pattern // splits on characters, but not grapheme clusters
  • A zero width successful match splits on characters too
  • The single space in quotes splits on whitespace and discards leading empty fields
  • The ^ anchor by itself splits into lines, even without the /m

Trace your Perl programs

You can write your own mini (or micro) debuggers to watch your program run. You might want to do this when the other Perl debuggers are too heavyweight (or even too interactive) for your immediate problem.

The Devel::Trace module is a simple debugging tool that lets you watch the flow of control as your program runs. It’s not a debugger and you don’t get to interact with it, but it can be useful in some operations.

If you were a shell programmer in a former life, you probably used the -x switch to watch your shell script execute. Here’s simple script:

#!/bin/sh

echo $HOME
cd ~/Dev/effective-perl-programming
du -s

When you run it with the -x, it prints each line of the script before it executes it:

% sh -x trace.sh
+ echo /Users/brian
/Users/brian
+ cd /Users/brian/Dev/effective-perl-programming
+ du -s
200320  .

You can do the same thing with a Perl program. Here’s a small program to find factorials:

use v5.10;
use strict;
use warnings;

my $n = $ARGV[0];

my $product = 1;

foreach ( 1 .. $n ) {
	$product *= $_;
	}
	
say $product;

The output shows each line as perl executes it:

% perl -d:Trace factorial.pl
>> factorial.pl:5: my $n = $ARGV[0];
>> factorial.pl:7: my $product = 1;
>> factorial.pl:9: foreach ( 1 .. $n ) {
>> factorial.pl:10: 	$product *= $_;
>> factorial.pl:10: 	$product *= $_;
>> factorial.pl:10: 	$product *= $_;
>> factorial.pl:13: say $product;
6

Devel::Trace doesn’t limit itself to your program though. If you use the bignum module, the multiplication happens on objects instead of Perl numbers:

use v5.10;
use strict;
use warnings;
use bignum;

my $n = $ARGV[0];

my $product = 1;

foreach ( 1 .. $n ) {
	$product *= $_;
	}
	
say $product;

Try it with a number that requires bignum, there’s quite a bit of output:

% perl -d:Trace fact.pl 200
>> fact.pl:6: my $n = $ARGV[0];
>> fact.pl:8: my $product = 1;
>> fact.pl:10: foreach ( 1 .. $n ) {
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt.pm:152: '0+' => sub { $_[0]->numify(); }
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt.pm:839:   my $x = shift; $x = $class->new($x) unless ref $x;
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt.pm:839:   my $x = shift; $x = $class->new($x) unless ref $x;
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt.pm:841:   return $x->bstr() if $x->{sign} !~ /^[+-]$/;
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt.pm:842:   my $num = $CALC->_num($x->{value});
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt/Calc.pm:276:     my $x = $_[1];
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt/Calc.pm:278:     return 0 + $x->[0] if scalar @$x == 1;      # below $BASE
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt.pm:843:   return -$num if $x->{sign} eq '-';
.... # about 23,800 more lines
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt/Calc.pm:268:     $l--;
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt/Calc.pm:267:     $ret .= substr($z.$ar->[$l],-$BASE_LEN); # fastest way I could think of
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt/Calc.pm:268:     $l--;
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt/Calc.pm:270:   $ret;
788657867364790503552363213932185062295135977687173263294742533244359449963403342920304284011984623904177212138919638830257642790242637105061926624952829931113462857270763317237396988943922445621451664240254033291864131227428294853277524242407573903240321257405579568660226031904170324062351700858796178922222789623703897374720000000000000000000000000000000000000000000000000

That’s pretty ugly, and the Devel::Trace module doesn’t give you any way to change it. That’s not surprising consider the source code, which is

package Devel::Trace;
$VERSION = '0.11';
$TRACE = 1;

# This is the important part.  The rest is just fluff.
sub DB::DB {
  return unless $TRACE;
  my ($p, $f, $l) = caller;
  my $code = \@{"::_<$f"};
  print STDERR ">> $f:$l: $code->[$l]";
}


sub import {
  my $package = shift;
  foreach (@_) {
    if ($_ eq 'trace') {
      my $caller = caller;
      *{$caller . '::trace'} = \&{$package . '::trace'};
    } else {
      use Carp;
      croak "Package $package does not export `$_'; aborting";
    }
  }
}

my %tracearg = ('on' => 1, 'off' => 0);
sub trace {
  my $arg = shift;
  $arg = $tracearg{$arg} while exists $tracearg{$arg};
  $TRACE = $arg;
}

1;

The magic happens in DB::DB, which, when perl calls before every statement. This module is Devel::Trace, but its defining its meat in the package DB. The debugger stashes lots of information about the program in various package variables. The @{"::_<$f"} is a symbolic reference, using the filename in $f to translate the filename and line number from caller to a line of code.

You don't need the module, really. The import and trace subroutines are there just to turn things on and off. To do that, you'd have to modify your code to litter calls to trace throughout the code. You usually don't want to modify the code.

You can reduce that module to a command line. When you run the debugger, perl inserts the value in the PERL5BD environment variable before the source in your program. The reduces Devel::Trace to:

PERL5DB='sub DB::DB {my @c=caller;print STDERR qq|@c[1,2] ${"::_<$c[1]"}[$c[2]]|}' perl5.14.1 -d fact.pl 4

That's a bit hard to use, especially on a command line that tries to interpret those sigils characters. You don't need to worry about that now that you know how simple it is. You can create your own throw-away module to trace exactly what you want. Since the current directory is usually already in @INC, you can create a Devel/MyTrace.pm. This one saves to the file you specify:

use v5.10;
use autodie;

BEGIN {

my $trace_file = $ENV{TRACE_FILE} // "mytrace.$$";
print STDERR "Saving trace to $trace_file\n";
open my $fh, '>>', $trace_file;

sub DB::DB {
	my( $package, $file, $line ) = caller;
	my $code = \@{"::_<$file"};
	print $fh "[@{[time]}] $file $line $code->[$line]";
	}
}

1;

That is only slightly improved on Devel::Trace because it lets you specify the file in which to save your trace. You can improve that a bit to also handle the special - or STDERR virtual filenames:

use v5.10;
use autodie;

BEGIN {

my $trace_file = $ENV{TRACE_FILE} // "mytrace.$$";
print STDERR "Saving trace to $trace_file\n";

my $fh = do {
	   if( $trace_file eq '-'      ) { \*STDOUT }
	elsif( $trace_file eq 'STDERR' ) { \*STDERR }
	else { 
		open my $fh, '>>', $trace_file;
		$fh;
		}
	};

sub DB::DB {
	my( $package, $file, $line ) = caller;
	my $code = \@{"::_<$file"};
	print $fh "[@{[time]}] $file $l $code->[$line]";
	}
}

1;

You can go one step further to exclude all code except for the primary file:

use v5.10;
use autodie;

BEGIN {

my $trace_file = $ENV{TRACE_FILE} // "mytrace.$$";
print STDERR "Saving trace to $trace_file\n";

my $fh = do {
	   if( $trace_file eq '-'      ) { \*STDOUT }
	elsif( $trace_file eq 'STDERR' ) { \*STDERR }
	else { 
		open my $fh, '>>', $trace_file;
		$fh;
		}
	};

sub DB::DB {
	my( $package, $file, $line ) = caller;
	return unless $file eq $0;
	my $code = \@{"::_<$file"};
	print $fh "[@{[time]}] $file $l $code->[$line]";
	}
}

1;

From there, what your tracer does is up to you and what problem you are trying to track down. It doesn't have to be fancy and you don't have to spend a lot of time making it perfect or "pure".

Things to remember

  • You can trace your Perl program just like a shell program
  • Under the -d switch, perl calls &DB::DB before each statement
  • You can write your own throw-away tracer

Use Data::Printer to debug data structures

You can use several different Perl modules to inspect data structures. Many of these modules, however, are really two tools in one. Besides showing a data structure as a string, they also serialize the data as Perl code so you can reconstruct the data structure. That second job often makes things hard for you. If you don’t need the serialization
job, don’t use a module that insists on it.

The Data::Dumper module is popular because it comes with Perl. Here’s a program that we’ll use for the rest of the Item, save for changes to the module dumping the structure:

use Data::Dumper qw(Dumper);
use DateTime;
use HTTP::Request;

my $request = HTTP::Request->new(
		GET => 'http://www.perl.org',
		);

$request->header( 'X-Perl' => '5.12.2' );
$request->header( 'Cat'    => 'Buster' );

my $data = {
	hash => {
		cat  => 'Buster',
		dog  => 'Addy',
		bird => 'Poppy',
		},
	array => [ qw( a b c ) ],
	datetime => DateTime->now,
	reqeust  => $request,
	};

print Dumper( $data );

The output is a Perl data structure, suitable for eval. That makes it a bit verbose and ugly:

$VAR1 = {
      'array' => [
             'a',
             'b',
             'c'
           ],
      'hash' => {
            'cat' => 'Buster',
            'dog' => 'Addy',
            'bird' => 'Poppy'
          },
      'reqeust' => bless( {
                '_content' => '',
                '_uri' => bless( do{\(my $o = 'http://www.perl.org')}, 'URI::http' ),
                '_headers' => bless( {
                             'cat' => 'Buster',
                             'x-perl' => '5.12.2'
                           }, 'HTTP::Headers' ),
                '_method' => 'GET'
                }, 'HTTP::Request' ),
      'datetime' => bless( {
                 'local_rd_secs' => 68540,
                 'local_rd_days' => 734452,
                 'rd_nanosecs' => 0,
                 'locale' => bless( {
                            'default_time_format_length' => 'medium',
                            'native_territory' => 'United States',
                            'native_language' => 'English',
                            'native_complete_name' => 'English United States',
                            'en_language' => 'English',
                            'id' => 'en_US',
                            'default_date_format_length' => 'medium',
                            'en_complete_name' => 'English United States',
                            'en_territory' => 'United States'
                          }, 'DateTime::Locale::en_US' ),
                 'local_c' => {
                        'hour' => 19,
                        'second' => 20,
                        'month' => 11,
                        'quarter' => 4,
                        'day_of_year' => 315,
                        'day_of_quarter' => 42,
                        'minute' => 2,
                        'day' => 11,
                        'day_of_week' => 5,
                        'year' => 2011
                        },
                 'utc_rd_secs' => 68540,
                 'formatter' => undef,
                 'tz' => bless( {
                          'name' => 'UTC'
                        }, 'DateTime::TimeZone::UTC' ),
                 'utc_year' => 2012,
                 'utc_rd_days' => 734452,
                 'offset_modifier' => 0
                 }, 'DateTime' )
    };

The Data::Dump also serializes, and is a cleaner Data::Dumper. In void context, it automatically prints for you:

use Data::Dump qw(pp);

...; # same $data thing as before

pp( $data );

The output looks a lot like the Data::Dumper output because it has to be a Perl data structure:

{
  array  => ["a", "b", "c"],
  datetime => bless({
        formatter     => undef,
        local_c     => {
                   day => 11,
                   day_of_quarter => 42,
                   day_of_week => 5,
                   day_of_year => 315,
                   hour => 19,
                   minute => 18,
                   month => 11,
                   quarter => 4,
                   second => 33,
                   year => 2011,
                   },
        local_rd_days   => 734452,
        local_rd_secs   => 69513,
        locale      => bless({
                   default_date_format_length => "medium",
                   default_time_format_length => "medium",
                   en_complete_name => "English United States",
                   en_language => "English",
                   en_territory => "United States",
                   id => "en_US",
                   native_complete_name => "English United States",
                   native_language => "English",
                   native_territory => "United States",
                   }, "DateTime::Locale::en_US"),
        offset_modifier => 0,
        rd_nanosecs   => 0,
        tz        => bless({ name => "UTC" }, "DateTime::TimeZone::UTC"),
        utc_rd_days   => 734452,
        utc_rd_secs   => 69513,
        utc_year    => 2012,
        }, "DateTime"),
  hash   => { bird => "Poppy", cat => "Buster", dog => "Addy" },
  reqeust  => bless({
        _content => "",
        _headers => bless({ "cat" => "Buster", "x-perl" => "5.12.2" }, "HTTP::Headers"),
        _method  => "GET",
        _uri   => bless(do{\(my $o = "http://www.perl.org")}, "URI::http"),
        }, "HTTP::Request"),
}

Steven Haryanto added a filter feature to an existing interface, which you can see in Use Data::Dump filters for nicer pretty-printing. You can get better control of the parts you’d want to distill, such as that DateTime:

{
  array => # some items hidden
  [2011-02-03, "d", "...", "n"],
  datetime => 2011-02-03,
  hash => { bird => "Poppy", cat => "Buster", dog => "Addy" },
}

If you forget about the serialization, though, you can do much better. Often, you want to inspect a data structure to see what’s on the inside without saving it for future use. If that’s the case, you don’t need to see the data structure as Perl code and the pretty printer can organizer the data much better and provide more information. The Data::Printer module doesn’t care at all about serialization. In void context, its p automatically prints:

use Data::Printer;

...; # same $data thing as before

p( $data );

The output is as verbose, but it’s also much more dense. When it prints an object, it shows you the methods in the class:

\ {
    array      [
        [0] "a",
        [1] "b",
        [2] "c"
    ],
    datetime   DateTime  {
        public methods (134) : add, add_duration, am_or_pm, bootstrap, ce_year, christian_era, clone, compare, compare_ignore_floating, date, datetime, day, day_abbr, day_name, day_of_month, day_of_month_0, day_of_quarter, day_of_quarter_0, day_of_week, day_of_week_0, day_of_year, day_of_year_0, day_0, DefaultLanguage, DefaultLocale, delta_days, delta_md, delta_ms, dmy, doq, doq_0, dow, dow_0, doy, doy_0, duration_class, epoch, era, era_abbr, era_name, format_cldr, formatter, fractional_second, from_day_of_year, from_epoch, from_object, hires_epoch, hms, hour, hour_1, hour_12, hour_12_0, INFINITY, is_dst, is_finite, is_infinite, is_leap_year, iso8601, jd, language, last_day_of_month, leap_seconds, local_day_of_week, local_rd_as_seconds, local_rd_values, locale, MAX_NANOSECONDS, mday, mday_0, mdy, microsecond, millisecond, min, minute, mjd, mon, mon_0, month, month_abbr, month_name, month_0, NAN, nanosecond, NEG_INFINITY, new, now, offset, quarter, quarter_abbr, quarter_name, quarter_0, sec, second, SECONDS_PER_DAY, secular_era, set, set_day, set_formatter, set_hour, set_locale, set_minute, set_month, set_nanosecond, set_second, set_time_zone, set_year, STORABLE_freeze, STORABLE_thaw, strftime, subtract, subtract_datetime, subtract_datetime_absolute, subtract_duration, time, time_zone, time_zone_long_name, time_zone_short_name, today, truncate, utc_rd_as_seconds, utc_rd_values, utc_year, wday, wday_0, week, week_number, week_of_month, week_year, weekday_of_month, year, year_with_christian_era, year_with_era, year_with_secular_era, ymd
        private methods (38) : _accumulated_leap_seconds, _add_overload, _adjust_for_positive_difference, _calc_local_components, _calc_local_rd, _calc_utc_components, _calc_utc_rd, _cldr_pattern, _compare, _compare_overload, _day_has_leap_second, _day_length, _era_index, _format_nanosecs, _handle_offset_modifier, _is_leap_year, _month_length, _new, _new_from_self, _normalize_leap_seconds, _normalize_nanoseconds, _normalize_seconds, _normalize_tai_seconds, _offset_for_local_datetime, _rd2ymd, _seconds_as_components, _space_padded_string, _string_compare_overload, _string_equals_overload, _string_not_equals_overload, _stringify, _subtract_overload, _time_as_seconds, _utc_hms, _utc_ymd, _weeks_in_year, _ymd2rd, _zero_padded_number
        internals: {
            formatter         undef,
            local_c           {
                day              11,
                day_of_quarter   42,
                day_of_week      5,
                day_of_year      315,
                hour             19,
                minute           41,
                month            11,
                quarter          4,
                second           42,
                year             2011
            },
            local_rd_days     734452,
            local_rd_secs     70902,
            locale            DateTime::Locale::en_US,
            offset_modifier   0,
            rd_nanosecs       0,
            tz                DateTime::TimeZone::UTC,
            utc_rd_days       734452,
            utc_rd_secs       70902,
            utc_year          2012
        }
    },
    hash       {
        bird   "Poppy",
        cat    "Buster",
        dog    "Addy"
    },
    request    HTTP::Request  {
        Parents       HTTP::Message
        Linear @ISA   HTTP::Request, HTTP::Message
        public methods (10) : accept_decodable, as_string, clone, dump, method, new, parse, uri, uri_canonical, url
        private methods (0)
        internals: {
            _content   "",
            _headers   HTTP::Headers,
            _method    "GET",
            _uri       URI::http
        }
    }
}

You probably don’t want to see all that internal gunk from DateTime or HTTP::Request, so you can set filters from them to print them however you like:

use Data::Printer {
    filters => {
       'DateTime'      => sub { "DateTime => $_[0]" },
       'HTTP::Request' => sub { "URL => " . $_[0]->uri }, 
    },
};

...; # same $data thing as before

p( $data );

Now you can see what you need to see much easier:

\ {
    array      [
        [0] "a",
        [1] "b",
        [2] "c"
    ],
    datetime   DateTime => 2011-11-11T19:51:06,
    hash       {
        bird   "Poppy",
        cat    "Buster",
        dog    "Addy"
    },
    request    URL => http://www.perl.org
}

So far, you’ve changed the import list to specify what you wanted, but you can change it each time that you want to dump something:

p( $data, { index => 0 } );

Now you don’t have array indices:

\ {
    array      [
        "a",
        "b",
        "c"
    ],
    datetime   DateTime => 2011-11-11T20:25:22,
    hash       {
        bird   "Poppy",
        cat    "Buster",
        dog    "Addy"
    },
    request    URL => http://www.perl.org
}

You can make colorized output too by setting another property:

use Data::Printer {
	colored => 1,
	filters => {
		'DateTime'      => sub { "DateTime => $_[0]" },
		'HTTP::Request' => sub { "URL => " . $_[0]->uri }, 
	},
};

...; # same $data thing as before

p( $data );

And you can change the colors if you don’t like the default set. You have to choose a valid Term::ANSI color:

use Data::Printer {
	colored => 1,
	color => {
		array       => 'yellow',
		string      => 'cyan',
		hash        => 'green',
	},
	filters => {
		'DateTime'      => sub { "DateTime => $_[0]" },
		'HTTP::Request' => sub { "URL => " . $_[0]->uri }, 
	},
};

...; # same $data thing as before

p( $data );

This might be more pleasing to you:

Lastly, one of the most annoying “features” of a pretty printer is the constant reference passing. Since Perl flattens its argument list into a single list, to maintain data structure identities, you have to pass them as a reference:

use Data::Dumper;
print Dumper( \%hash );
use Data::Dump;
pp( \%hash );

You can do that with Data::Printer too:

use Data::Printer;
p( \%hash );

Data::Printer uses prototypes to make that easier for you. The Dumper and pp each dump a list of structures, but Data::Printer‘s p dumps exactly one structure. As such, it can use prototypes to recognize a whole hash or array as the first argument:

use Data::Printer;
p( %hash );

This feature has a few oddities, but Breno explains them in the documentation.

Things to remember

  • Serialization and inspection are different tasks
  • Most Perl pretty printers try to serialize
  • The Data::Printer

Make grep-like syntax

To create grep- or map-like syntax, you need to use Perl’s prototypes, despite whatever we told you in Understand why you probably don’t need prototypes. Perl needs the special hints that prototypes to parse a block as an argument to a subroutine.

First, remember the forms of grep. There’s a single expression version and a block version:

grep EXPR, @input       # with a comma
grep { ... } @input     # no comma

That block, the {...}, is an inline subroutine where the current element shows up in $_:

my @odds = grep { $_ % 2 } @input;

For either form, there’s a scalar and list return value, depending on context (Item 12. Understand context and how it affects operations):

my @array = grep ...;
my $count = grep ...;

You can make your own subroutines that work just like grep. The prototype character & tells perl to expect a subroutine reference. However, it can only be the first argument. To try it, define a subroutine that takes a single argument, a code reference:

sub run_it (&) {
	my $sub = shift;
	$sub->();
	}

You can call that subroutine in several ways. You can use a block, the sub keyword with a block, or a reference to a subroutine, or a reference to a named subroutine:

use v5.10;

sub named { say "I have a name" }

my $result = run_it { say "Hello!" };
   $result = run_it sub { say "I have a keyword!" };
   $result = run_it \&named;

However, perl is not smart enough to recognize other forms. It won’t like a scalar variable that might have a code reference later, and it can’t take a bareword that is the name of a defined subroutine (like sort will). These are compile-time errors:

use v5.10;

sub named { say "I have a name" }
my $code_ref = \&named;

my $result = run_it $code_ref;
   $result = run_it named;
   $result = run_it &named;

Handling grep’s second argument

Now you can take a code reference as an argument. The next part of the grep syntax is the the input. You could use the @ character to denote a list of arguments (not an array argument), and that appears to work:

use v5.10;
use warnings;

sub do_with_array (&@) { 
	my( $sub, @args ) = @_;
	my @output;
	
	foreach my $elem ( @args ) {
		local $_ = $elem;
		push @output, $sub->();
		}
	return @output;
}

sub other_cats { qw(Ellie Ginger) }
 
my @cats = qw(Buster Mimi Roscoe);

@result = do_with_array { say $_ } @cats;
@result = do_with_array { say $_ } qw(Buster Mimi Roscoe);
@result = do_with_array { say $_ } 1 .. 10;
@result = do_with_array { say $_ } other_cats();

Like grep which can alias $_ to the original data, you can also means that you can change the original data with your subroutine argument if you use @_ (Item 114. Know when arrays are modified in a loop):

use v5.10;
use warnings;

sub do_with_array (&@) {
   my $sub = shift;
   my @output;
   local $_;

   foreach ( @_ ) {
	   push @output, $sub->();
   }
   return @output;
}

my @original   = qw(1 2 3);
my @new        = do_with_array { $_ += 2 } @original;
say "new = @new";             # 3 4 5
say "original = @original";  # 3 4 5

You might try the \@ prototype, but that limits you in other ways. Now perl expects a named array as an argument. You cannot use an array reference, range, literal list, or the return values from a subroutine call. It’s a named array or an error. That’s no good.

Likewise, you might use the + prototype introduced in Perl 5.14. This allows you to use an array or an array reference argument. Perl doesn’t complain if you use a range or a subroutine call, but it also doesn’t do the right thing:

use v5.10;
use warnings;

sub do_with_array (&+) { 
	my( $sub, $array ) = @_;
	my @output;
	
	foreach my $elem ( @$array ) {
		local $_ = $elem;
		push @output, $sub->();
		}
	return @output;
}

sub other_cats { qw(Ellie Ginger) }

my @cats = qw(Buster Mimi Roscoe);

@result = do_with_array { say $_ } @cats;
@result = do_with_array { say $_ } [ 'a' .. 'g' ];
@result = do_with_array { say $_ } 1 .. 10;
@result = do_with_array { say $_ } other_cats();

The named subroutine and the array reference work just fine. Perl does something weird with the range, and the subroutine call appears to not happen at all:

Buster
Mimi
Roscoe
a
b
c
d
e
f
g
Use of uninitialized value $. in range (or flip) at run_it.pl line 21.

Not only that, but the + prototype character also allows named hashes and hash references:

use v5.10;
use warnings;

sub do_with_array (&+) { 
	my( $sub, $array ) = @_;
	my @output;
	
	foreach my $elem ( @$array ) {
		local $_ = $elem;
		push @output, $sub->();
		}
	return @output;
}

my %cats = qw(Buster Mimi Roscoe Ellie);

@result = do_with_array { say $_ } %cats;
@result = do_with_array { say $_ } { 'a' => 'b' };

You don’t get an error until runtime when you try the array dereference:

Not an ARRAY reference at run_it.pl line 8.

You can check these things at runtime, though. This reminds you, as we said in Understand why you probably don’t need prototypes, that prototypes probably don’t do what you think. You end up doing a lot of the work that most people think prototypes do for you:

use v5.10;
use warnings;
use Carp;

sub do_with_array (&+) { 
	my( $sub, $array ) = @_;
	croak "do_with_array takes an array argument"
		unless ref $array eq ref [];

	my @output;
	
	foreach my $elem ( @$array ) {
		local $_ = $elem;
		push @output, $sub->();
		}

	return @output;
}

my @cats = qw(Buster Mimi Roscoe Ellie);
my %cats = map { $_, 1 } @cats;

@result = do_with_array { say $_ } @cats;
@result = do_with_array { say $_ } %cats;

Handling context

We explained context in Item 12. Understand context and how it affects operations, and if you want to emulate grep you have to handle them. In list context grep returns a list, in scalar context it returns a count, and in void context it potentially does nothing. You really only need to handle the void case. In this case, you’ll simply return without doing anything:

use v5.10;
use warnings;
use Carp;

sub do_with_array (&+) {
	return unless defined wantarray;
	my( $sub, $array ) = @_;
	croak "do_with_array takes an array argument"
		unless ref $array eq ref [];

	my @output;
	
	foreach my $elem ( @$array ) {
		local $_ = $elem;
		push @output, $sub->();
		}

	return @output;
}

The list and scalar contexts come from returning an array. When you return a named array, you get the same results as assigning an array. In list context you get the list elements, and in scalar context you get the count. If you want to return something different, such as a list instead of a named array, you have to do more work:

	return wantarray ? qw( a b c ) : 3;

So, once again, prototypes half solve the problem, but leave you with more work to do.

Things to remember

  • Use the & prototype character to specify a code reference argument
  • If the code reference argument argument is the first argument, you can leave off the sub keyword
  • The reference has to be a block of code or a reference to a named subroutine, and specifically not a scalar variable

Profile with Devel::NYTProf

Profile before you decide where to optimize—you might be surprised where you’re losing all of your performance.

We won’t go into all the details of profiling in this Item, but you can read about those in Mastering Perl. In short, profilers count something then report the results. They can track any of the things that you might care about. The Devel::NYTProf module, like most profilers, tracks time, counting the statements you run and how long they take.

Basic profiling

Perl profilers are really debuggers, using the internal hooks to detect movement in and out of statements. As such, you can load it with the -d switch just like any other debugger:

% perl -d:NYTProf some_program

You can also load it as a module and NYTProf will figure it out:

% perl -MDevel::NYTProf some_program

Either way, you can set the NYTProf environment variable to specify various settings (which you’ll have to read the documentation). For instance, to start profiling after the compilation phase, you use the start field:

% export NYTPROF=start=init
% perl -MDevel::NYTProf some_program

That’s it. Your program will run slower as NYTProf watches it and records that data, so don’t worry about the wallclock time. If your program is already agonizingly slow, it’s going to seem worse. Perhaps not a lot worse, but you’ll probably notice the difference. Maybe it’s time to get a cup of coffee.

The reports

NYTProf stores its data in the ./nytprof.out file (although you can change that). That’s just the data, though. To look at it, you need to convert it to a particular format. The distribution comes with several programs to do that already. The easiest and most popular might be HTML:

% nytprofhtml
Reading nytprof.out
Processing nytprof.out data
Writing sub reports to nytprof directory
 100% ... 
Writing block reports to nytprof directory
 100% ... 
Writing line reports to nytprof directory
 100% ... 

Inside the nytprof/ directory, you’ll find index.html, which presents the summary of the profile:

Here’s a full profile from the Pod::Perldoc module looking through perlfunc.

Call graphs

NYTProf also generates call graphs that you can inspect with Graphviz.

7ads6x98y