Define grammars in regular expressions

[ This is the 100th Item we've shared with you in the two years this blog has been around. We deserve a holiday and we're taking it, so read us next year! Happy Holidays.]

Perl 5.10 added rudimentary grammar support in its regular expressions. You could define many subpatterns directly in your pattern, use them to define larger subpatterns, and, finally, when you have everything in place, let Perl do the work.

There are other ways, some more powerful, that let you do the same thing. This Item is not about those, however, but you can read about Regex::Grammars, Parse::RecDescent on your own. Also, you’re not going to get much of a recommendation of which one you should use for your task. We don’t know your situation.

To understand this new syntax, you have to study it from the ground up. It’s not simple, and the terse documentation in perlre doesn’t do much to help.

Referencing a subpattern

The first part you need is the ability to call a named part of the pattern (to label a subpattern, see Item 31. Use named captures to label matches). To re-match a labeled subpattern, you use:

(?&NAME)

You can use that syntax to rerun a subpattern later:

use v5.10;

my $pattern = qr/
	(?<cat>Buster|Mimi)
	\s+
	(?&cat)
	/x;

foreach ( 'Buster Mimi', 'Mimi Buster', 'Buster', 'Buster Buster' ) {
	say "$_ ", m/$pattern/p ? "matched" : 'nope!';
	}

The labeled subpattern has an alternation where either cat name can match. When you reference it again, you re-run the alternation and you can match either cat name again:

Buster Mimi matched
Mimi Buster matched
Buster nope!
Buster Buster matched

This is not that same thing as matching the same text a labeled capture group already matched. That’s the \k<NAME>:

\k<NAME>

This pattern is a different beast. Whichever cat name matches first also has to match second:

use v5.10;

my $pattern = qr/
	(?<cat>Buster|Mimi)
	\s+
	\k<cat>
	/x;

foreach ( 'Buster Mimi', 'Mimi Buster', 'Buster', 'Buster Buster' ) {
	say "$_ ", m/$pattern/p ? "matched" : 'nope!';
	}

Now only one of the strings matches because only one string repeats a cat’s name:

Buster Mimi nope!
Mimi Buster nope!
Buster nope!
Buster Buster matched

Although you won’t see it here, the (?&NAME) syntax is the trick to matching a recursive pattern since the reference can appear inside the pattern it references.

Conditional match

The second building block you need starts with a conditional submatch:

(?(condition)yes-pattern|no-pattern)

That condition can be many things, most of which won’t appear in this Item. Although you see the | character, but this isn’t an alteration. It’s like an alternation because the | separates distinct subpatterns, but unlike an alternation because this will only ever try one of the subpatterns and you only get two subpatterns.

The simplest condition is just an ordinal number, which is true only if that capture group matched. Here’s a pattern that has two capture groups:

use v5.10;

my $pattern = qr/
	(?:           # parens for grouping
		(B)     # $1
		|         # alternation
		(M)     # $2
	)     
	(?(1)uster|imi) # conditional match
	/x;

foreach ( qw(Mimi Buster Muster Bimi Roscoe) ) {
	say "$_ ", m/$pattern/p ? "matched ${^MATCH}" : 'nope!';
	}

In this pattern, if the (B) matches, the conditional uses uster from the conditional. Otherwise, it uses imi. However, the only thing that can match besides a (B) is the other part of the alteration, the (M). The output shows that only Mimi or Buster matches:

Mimi matched Mimi
Buster matched Buster
Muster nope!
Bimi nope!
Roscoe nope!

You get the same results if you use (2) as the condition and re-arrange the order of the patterns:

my $pattern = qr/
	(?:(B)|(M))
	(?(2)imi|uster)
	/x;

Putting it together

The condition can also be the literal (DEFINE). In that case, Perl only allows a yes-branch. And, as its condition implies, it merely defines the patterns and does not execute them.

This means that you can create and label the subpatterns that you need, but not actually assert that any of them match the string. The definitions are just there. This pattern defines and labels three subpatterns then uses none of them:

use v5.10;

my $pattern = qr/
	(?(DEFINE)
		(?<cat>Buster)
		(?<dog>Addie)
		(?<bird>Poppy)
	)
	Mimi
	/x;

foreach ( 'Buster Mimi', 'Roscoe', 'Buster', 'Mimi' ) {
	say "$_ ", m/$pattern/ ? "matched" : 'nope!';
	}

It’s as if the DEFINE bit is not even there:

Buster Mimi matched
Roscoe nope!
Buster nope!
Mimi matched

Outside the (DEFINE), you can reference any of the subpatterns that you created:

use v5.10;

my $pattern = qr/
	(?(DEFINE)
		(?<cat>Buster)
		(?<dog>Addie)
		(?<bird>Poppy)
	)
	(?&cat)
	/x;

foreach ( 'Buster Mimi', 'Roscoe', 'Buster', 'Mimi' ) {
	say "$_ ", m/$pattern/ ? "matched" : 'nope!';
	}

Now Buster matches because you reference that defined subpattern:

Buster Mimi matched
Roscoe nope!
Buster matched
Mimi nope!

Now it’s time for the grammar. Inside the (DEFINE), you can reference subpatterns you haven’t defined yet, and your subpatterns can get arbitrarily complex:

use v5.10;

my $pattern = qr/
	(?(DEFINE)
		(?<male> Buster | Roscoe )
		(?<female> Mimi | Juliet )
		(?<cat> (?&male) | (?&female) )
		(?<dog>Addie)
		(?<bird>Poppy)
	)
	(?&cat)
	/x;

foreach ( 'Addie', 'Roscoe', 'Buster', 'Mimi' ) {
	say "$_ ", m/$pattern/ ? "matched" : 'nope!';
	}

Even though the cat names are in two different subpatterns, the cat subpattern unifies them so all the cat names match:

Addie nope!
Roscoe matched
Buster matched
Mimi matched

You should now be able to understand this regular expression from Tom Christainsen (appearing Stackoverflow). You might have to pick it apart, but you know how all the parts fit together to match the Internet Message Format defined in RFC 5322:

$rfc5322 = qr{

   (?(DEFINE)

     (?<address>         (?&mailbox) | (?&group))
     (?<mailbox>         (?&name_addr) | (?&addr_spec))
     (?<name_addr>       (?&display_name)? (?&angle_addr))
     (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
     (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
     (?<display_name>    (?&phrase))
     (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)

     (?<addr_spec>       (?&local_part) \@ (?&domain))
     (?<local_part>      (?&dot_atom) | (?&quoted_string))
     (?<domain>          (?&dot_atom) | (?&domain_literal))
     (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                   \] (?&CFWS)?)
     (?<dcontent>        (?&dtext) | (?&quoted_pair))
     (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])

     (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
     (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
     (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
     (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)

     (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
     (?<quoted_pair>     \\ (?&text))

     (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
     (?<qcontent>        (?&qtext) | (?&quoted_pair))
     (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                          (?&FWS)? (?&DQUOTE) (?&CFWS)?)

     (?<word>            (?&atom) | (?&quoted_string))
     (?<phrase>          (?&word)+)

     # Folding white space
     (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
     (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
     (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
     (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
     (?<CFWS>            (?: (?&FWS)? (?&comment))*
                         (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

     # No whitespace control
     (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])

     (?<ALPHA>           [A-Za-z])
     (?<DIGIT>           [0-9])
     (?<CRLF>            \x0d \x0a)
     (?<DQUOTE>          ")
     (?<WSP>             [\x20\x09])
   )

   (?&address)

}x;

If that’s not clever enough for you, try Tom’s use of (DEFINE) to properly parse HTML.

Things to remember

  • You can reference a named subpattern with (?&NAME)
  • You can choose a subpattern with a condition (?(condition)yes-pattern|no-pattern)
  • You can define and label subpatterns for later use with (DEFINE)

Create your own dualvars

Perl’s basic data type is the scalar, which takes its name from the mathematical term for “single item”. However, the scalar is really two things. You probably know that a scalar can be either a number or a string, or a number that looks the same as its string, or a string that can be a number. What you probably don’t know is that a scalar can be two separate and unrelated values at the same time, making it a dualvar.

You’re already using a dualvar without knowing it. The $! variable, which holds the value of the last system error, is most often used in its string form:

open my $fh, '>', $filename or die "Error: $!";

If something goes wrong with that open, you’ll get an error such as these:

No such file or directory
Permission denied

Both of those error messages have numbers associated with them. You can output their numeric value

open my $fh, '>', $filename or die $! + 0;

Now the errors are numbers, which correspond to the errno value for the system call:

2
13

These numbers are keys in the %!, and the value for that key is true if that was the last system error.

Perl, on its own, doesn’t give you a way to create this sort of variable yourself. When you assign a new value to a scalar, whether string or number, Perl clears the previous values it has. When you use a number as a string, Perl converts it to a string, and the same the other way around.

You can watch this with Devel::Peek. Here’s a program that sets a string value:

use Devel::Peek;

my $value = 'abc';

Dump( $value );

In the scalar record, the POK flag is set, indicating the variable has a string form, and the PV slot has a value (see perlguts for more details):

SV = PV(0x100801070) at 0x100827810
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x100202870 "abc"\0
  CUR = 3
  LEN = 16

If you set a numeric value, the flags are different:

use Devel::Peek;

my $value = 137;

Dump( $value );

Now there’s an IOK flag, and one of the numeric slots, in this case the IV, is set:

SV = IV(0x100827800) at 0x100827810
  REFCNT = 1
  FLAGS = (PADMY,IOK,pIOK)
  IV = 137

However, if you take the numeric value and use it in a string context, Perl also creates a string version. Now the scalar has both the IOK and POK flags, and both the IV and PV slots have values:

137
SV = PVIV(0x100809208) at 0x100827840
  REFCNT = 1
  FLAGS = (PADMY,IOK,POK,pIOK,pPOK)
  IV = 137
  PV = 0x100202870 "137"\0
  CUR = 3
  LEN = 16

If you change the variable, some of those flags disappear, even though the values don’t necessarily disappear:

use v5.10;
use Devel::Peek;

my $value = 137;

$value = 'Buster';

Dump( $value );

Now the scalar’s value should be just Buster, but the IV slot still has the old 137. However, the IOK flag is gone:

SV = PVIV(0x100809208) at 0x100827840
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  IV = 137
  PV = 0x100202870 "Buster"\0
  CUR = 6
  LEN = 16

When you set the new value, and Perl hadn’t used it in a numeric context yet, there was no need to go through the work to translate the string value to the number value. Instead, it just unset the flag that denotes its okay to use the value as a number. You have to use it as a number to again:

use v5.10;
use Devel::Peek;

my $value = 137;

$value = 'Buster';

say $value + 0;

Dump( $value );

Now Perl converts it to a number and sets the numeric flags again:

0
SV = PVNV(0x100801e30) at 0x100827840
  REFCNT = 1
  FLAGS = (PADMY,POK,pIOK,pNOK,pPOK)
  IV = 0
  NV = 0
  PV = 0x100202870 "Buster"\0
  CUR = 6
  LEN = 16

That’s how it works if you go through the Perl interface, but if you play it a scalar through the XS interface, you can set whatever flags and values that you like. That’s exactly what Scalar::Util‘s dualvar does that for you:

use v5.10;
use Devel::Peek;
use Scalar::Util qw(dualvar);

my $value = dualvar 137, 'Buster';

Dump( $value );

say "$value";
say $value + 0;

Now you have a scalar which has unrelated numeric and string values, and the flags for both values are set:

SV = PVNV(0x100802010) at 0x1008277c8
  REFCNT = 1
  FLAGS = (PADMY,IOK,POK,pIOK,pPOK)
  IV = 137
  NV = 0
  PV = 0x100202870 "Buster"\0
  CUR = 6
  LEN = 16
Buster
137

Things to remember

  • Scalars can have both numeric and string values at the same time
  • Those two values can be unrelated
  • You can create your own dualvar with Scalar::Util

Make disposable web servers for testing

If you project depends on a interaction with a web server, especially a remote one, you have some challenges with testing that portion. Even if you can get it working for you, when you distribute your code, someone else might not be able to reach your server for testing. Instead of relying on an external server, you can use a local server that you write especially for your test suite.

This problem has a couple tricky parts. To run a test server from your own test suite, your server needs to bind to a port that’s not already in use. When it has a port, it needs to communicate that to your test script.

Another problem, which you won’t consider for this Item, involves the configurability of your program so you can change the hostname and port during the test. This Item assumes you’ve taken are of that bit.

The Test::Fake::HTTPD module can create a web server directly from your test script. This example creates a web server that returns the same JSON response for every request:

use Test::More;
use Test::Fake::HTTPD;

use Mojo::UserAgent;

my $httpd = run_http_server {
	my $request = shift;

	return [ 
		200, 
		[ 'Content-Type' => 'application/json' ], 
		[ '{ "cat": "Buster" }' ] 
		];
	};

ok( defined $httpd, 'Got a web server' );

diag( sprintf "You can connect to your server at %s.\n", $httpd->host_port );

my $response = Mojo::UserAgent->new->get(
	$httpd->endpoint
	)->res;

diag( $response->to_string );
is( $response->json->{cat}, 'Buster', 'Cat is Buster' );

done_testing();

The test output shows the a message telling you the host and port, as well as the response:

ok 1 - Got a web server
# You can connect to your server at 127.0.0.1:50602.
# HTTP/1.1 200 OK
# Content-Type: application/json
# Date: Wed, 07 Dec 2011 08:44:40 GMT
# Content-Length: 19
# Server: libwww-perl-daemon/6.00
# 
# { "cat": "Buster" }
ok 2 - Cat is Buster
1..2

When the test script ends (or the web server variable goes out of scope, so there’s nothing for you to cleanup. It’s a disposable web server.

In your web server, you can do anything that you like. It doesn’t have to implement everything, or even close to everything, that the production server does. It just has to return responses that you can use in your tests. That means that you get to control not only the success, but also the failures.

That Server line gives you a hint about what created the $request object—it’s LWP behind the scenes, so it’s HTTP::Request. You can get the requested path with the uri method and decide what to do:

use strict;
use warnings;

use Test::More;
use Test::Fake::HTTPD;

use Mojo::UserAgent;
use URI;

my $httpd = run_http_server {
	my $request = shift;

	my $uri = $request->uri;

	return do {
		if( $uri->path eq '/' ) {
			[ 
				200, 
				[ 'Content-Type' => 'text/plain' ], 
				[ "Ask about our cats!" ], 
			]
			}
		elsif( $uri->path eq '/cats' ) {
			[ 
				200, 
				[ 'Content-Type' => 'application/json' ], 
				[ '{ "cat": "Buster" }' ], 
			]
			}
		elsif( $uri->path eq '/dogs' ) {
			[ 
				408, 
				[ 'Content-Type' => 'text/plain' ], 
				[ "We don't walk dogs" ], 
			]
			}
		else {
			[ 
				404, 
				[ 'Content-Type' => 'text/plain' ], 
				[ "Not Found" ], 
			]
			}
		}		
	};

ok( defined $httpd, 'Got a web server' );
diag( sprintf "You can connect to your server at %s.\n", $httpd->host_port );

my $uri = URI->new( $httpd->endpoint );
isa_ok( $uri, 'URI' );


subtest '/' => sub {
	plan tests => 3;
	
	my $this = $uri->clone;
	isa_ok( $this, 'URI' );
	$this->path( '/' );
	
	my $response = Mojo::UserAgent->new->get( $this )->res;
	
	is( $response->headers->content_type, 
		'text/plain', 'Top level is plain text' );
	like( $response->body , qr/cats/, 'Top level has cats' );
	};
	
subtest '/cats' => sub {
	plan tests => 2;
	
	my $this = $uri->clone;
	isa_ok( $this, 'URI' );
	$this->path( '/cats' );
	
	my $response = Mojo::UserAgent->new->get( $this )->res;
	
	is( $response->headers->content_type, 
		'application/json', '/cats returns JSON' );
	};

subtest '/not_there' => sub {
	plan tests => 2;
	
	my $this = $uri->clone;
	isa_ok( $this, 'URI' );
	$this->path( '/not_there' );
	
	my $response = Mojo::UserAgent->new->get( $this )->res;

	is( $response->code, 
		'404', '/not_there returns 404' );
	};

done_testing();

With subtests organized around accesses to paths, the TAP isn’t so hard to read (although you probably won’t have to look at it yourself):

ok 1 - Got a web server
# You can connect to your server at 127.0.0.1:50498.
ok 2 - The object isa URI
    1..3
    ok 1 - The object isa URI
    ok 2 - Top level is plain text
    ok 3 - Top level has cats
ok 3 - /
    1..2
    ok 1 - The object isa URI
    ok 2 - /cats returns JSON
ok 4 - /cats
    1..2
    ok 1 - The object isa URI
    ok 2 - /not_there returns 404
ok 5 - /not_there
1..5

If you need to use the same test webserver in more than one test script, you move it into its own file.

use strict;
use warnings;

use Test::More;

use Mojo::UserAgent;
use URI;

require 'server.pl';
my $httpd = get_http_server();

ok( defined $httpd, 'Got a web server' );
diag( sprintf "You can connect to your server at %s.\n", $httpd->host_port );

my $uri = URI->new( $httpd->endpoint );
isa_ok( $uri, 'URI' );

...

The server.pl file wraps the call to run_http_server. But, if you are going to do that, you might as well skip the convenience method and set up your own object. You can change the timeout value, for instance:

use Test::Fake::HTTPD;

sub get_http_server {
	my $httpd = Test::Fake::HTTPD->new(
		timeout => 30,
		);

	$httpd->run( sub {
		my $request = shift;
		...;		
		} );
		
	$httpd;
	}

Different test scripts each get their own test web server. If you’re running several tests scripts in parallel, you’ll start several servers at the same time, which not be that kind to your system or to the other people using it. If you don’t like that, you could set up a single server at the start of your test run, share it with all tests, and shut down everything at the end, although you won’t see that in this Item.

Things to remember

  • Test web interactions locally
  • Use Test::Fake::HTTPD to create cheap, disposable web servers

Know split’s special cases

Perl’s split has some special cases and some perhaps surprising cases. The empty pattern, zero width match, the special argument ' ', and the /^/ act differently than you might expect from the general rule.

The empty pattern, //

The empty pattern is a special case that’s designed to give you a list of characters. This pattern specifically has nothing in it and is different than a pattern that matches an empty string (that’s next). For this, split returns a list of characters:

use utf8;

use Data::Printer;
my @characters = split //, 'Büster';

p( @characters );

The output shows a list of characters:

[
    [0] "B",
    [1] "ü",
    [2] "s",
    [3] "t",
    [4] "e",
    [5] "r"
]

This is specifically characters, not grapheme clusters. Depending on the normalization of your source code or input, you can get different results:

use utf8;

use Data::Printer;
use Unicode::Normalize qw(NFD);

my @characters = split //, NFD( 'Büster' );

p( @characters );

Now the grapheme cluster ü is actually two characters, the u (U+0075 ʟᴀᴛɪɴ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴜ) and the ¨ (U+0308 ᴄᴏᴍʙɪɴɪɴɢ ᴅɪᴀᴇʀᴇꜱɪꜱ), instead of the single ü (U+00FC ʟᴀᴛɪɴ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴜ ᴡɪᴛʜ ᴅɪᴀᴇʀᴇꜱɪꜱ):

[
    [0] "B",
    [1] "u",
    [2] "¨",
    [3] "s",
    [4] "t",
    [5] "e",
    [6] "r"
]

You can review grapheme clusters in Treat Unicode strings as grapheme clusters.

Matching the empty string

Successfully matching no characters isn’t really a special case, but people are sometimes surprised about it because it seems special. Your split pattern might match an empty string. This is different from the empty pattern because you actually have a pattern, even if it might match zero characters. This is also distinct from a pattern that doesn’t match:

use v5.10;

my $_ = 'Mimi';

say "Matched empty pattern" if //;
say "Matched optional whitespace" if /\s*/;
say "Matched zero width assertion" if /(?=\w+)/;
say "How did Buster match?" if /Buster/;

The first three of these patterns match successfully but matches zero characters, while the fourth fails:

Matched empty pattern
Matched optional whitespace
Matched zero width assertion

It’s easy to construct a pattern that will match zero characters even though it matches successfully. The ? (zero or one) and * (zero or more) quantifiers do that quite nicely. Zero width assertions, such as the boundaries and lookarounds, do that too. If the pattern can match zero characters successfully, Perl splits into characters:

use Data::Printer;
my @characters = split /\s*/, 'Buster';

p( @characters );
[
    [0] "B",
    [1] "u",
    [2] "s",
    [3] "t",
    [4] "e",
    [5] "r"
]

The pattern doesn’t have to match zero characters for all separators.

use Data::Printer;
my @characters = split /\s*/, 'Buster and Mimi';

p( @characters );

Notice that there are no spaces in @characters, since split matched those as separator characters:

[
    [0]  "B",
    [1]  "u",
    [2]  "s",
    [3]  "t",
    [4]  "e",
    [5]  "r",
    [6]  "a",
    [7]  "n",
    [8]  "d",
    [9]  "M",
    [10] "i",
    [11] "m",
    [12] "i"
]

The single space, ‘ ‘

The single space in quotes, single or double, is a special case. It splits on whitespace, but unlike the pattern that is a single space, the one in quotes discards empty leading fields:

use Data::Printer;
my @characters = split ' ', '  Buster and Mimi';

p( @characters );

You get just the non-whitespace with no empty fields:

[
    [0] "Buster",
    [1] "and",
    [2] "Mimi"
]

This behavior comes from awk:

#!/usr/bin/awk -f
BEGIN {
    string="  Buster Mimi Roscoe";
    search=" ";
    n=split(string,array," ");
    print("[");
    for (i=1;i<=n;i++) {
        printf("    [%d] \"%s\"\n",i,array[i]);
    }
    print("]");
    exit;
}

You end up with almost the same input, although the indices are one greater:

[
    [1] "Buster"
    [2] "Mimi"
    [3] "Roscoe"
]

Back in Perl, if you tried that with the normal match operator delimiters, you get a different result:

use Data::Printer;
my @characters = split / /, '  Buster and Mimi';

p( @characters );

This time you kept the empty leading fields:

[
    [0] "",
    [1] "",
    [2] "Buster",
    [3] "and",
    [4] "Mimi"
]

If you include the m in front of the quotes though, you lose the special magic:

use Data::Printer;
my @characters = split m' ', '  Buster and Mimi';

p( @characters );

The empty leading fields are back:

[
    [0] "",
    [1] "",
    [2] "Buster",
    [3] "and",
    [4] "Mimi"
]

Splitting lines

The special pattern of just the beginning-of-line anchor, even without the /m flag, breaks a multi-line string into lines:

use Data::Printer;

my $string = <<'HERE';
Line one
Line two
Line three
HERE

my @lines = split /^/, $string;

p( @lines );

Even without the /m you get separate lines:

[
    [0] "Line one
",
    [1] "Line two
",
    [2] "Line three
"
]

This only works if the pattern is exactly /^/. If you put anything else in the pattern, you don't get the special behavior, even if it's a zero width match:

...; # same as before

my @lines = split /^(?=Line)/, $string;  # Oops

p( @lines );

Now there's only one field:

[
    [0] "Line one
Line two
Line three
Line four
"
]

Things to remember

  • The empty pattern // splits on characters, but not grapheme clusters
  • A zero width successful match splits on characters too
  • The single space in quotes splits on whitespace and discards leading empty fields
  • The ^ anchor by itself splits into lines, even without the /m

Trace your Perl programs

You can write your own mini (or micro) debuggers to watch your program run. You might want to do this when the other Perl debuggers are too heavyweight (or even too interactive) for your immediate problem.

The Devel::Trace module is a simple debugging tool that lets you watch the flow of control as your program runs. It’s not a debugger and you don’t get to interact with it, but it can be useful in some operations.

If you were a shell programmer in a former life, you probably used the -x switch to watch your shell script execute. Here’s simple script:

#!/bin/sh

echo $HOME
cd ~/Dev/effective-perl-programming
du -s

When you run it with the -x, it prints each line of the script before it executes it:

% sh -x trace.sh
+ echo /Users/brian
/Users/brian
+ cd /Users/brian/Dev/effective-perl-programming
+ du -s
200320  .

You can do the same thing with a Perl program. Here’s a small program to find factorials:

use v5.10;
use strict;
use warnings;

my $n = $ARGV[0];

my $product = 1;

foreach ( 1 .. $n ) {
	$product *= $_;
	}
	
say $product;

The output shows each line as perl executes it:

% perl -d:Trace factorial.pl
>> factorial.pl:5: my $n = $ARGV[0];
>> factorial.pl:7: my $product = 1;
>> factorial.pl:9: foreach ( 1 .. $n ) {
>> factorial.pl:10: 	$product *= $_;
>> factorial.pl:10: 	$product *= $_;
>> factorial.pl:10: 	$product *= $_;
>> factorial.pl:13: say $product;
6

Devel::Trace doesn’t limit itself to your program though. If you use the bignum module, the multiplication happens on objects instead of Perl numbers:

use v5.10;
use strict;
use warnings;
use bignum;

my $n = $ARGV[0];

my $product = 1;

foreach ( 1 .. $n ) {
	$product *= $_;
	}
	
say $product;

Try it with a number that requires bignum, there’s quite a bit of output:

% perl -d:Trace fact.pl 200
>> fact.pl:6: my $n = $ARGV[0];
>> fact.pl:8: my $product = 1;
>> fact.pl:10: foreach ( 1 .. $n ) {
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt.pm:152: '0+' => sub { $_[0]->numify(); }
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt.pm:839:   my $x = shift; $x = $class->new($x) unless ref $x;
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt.pm:839:   my $x = shift; $x = $class->new($x) unless ref $x;
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt.pm:841:   return $x->bstr() if $x->{sign} !~ /^[+-]$/;
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt.pm:842:   my $num = $CALC->_num($x->{value});
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt/Calc.pm:276:     my $x = $_[1];
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt/Calc.pm:278:     return 0 + $x->[0] if scalar @$x == 1;      # below $BASE
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt.pm:843:   return -$num if $x->{sign} eq '-';
.... # about 23,800 more lines
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt/Calc.pm:268:     $l--;
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt/Calc.pm:267:     $ret .= substr($z.$ar->[$l],-$BASE_LEN); # fastest way I could think of
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt/Calc.pm:268:     $l--;
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt/Calc.pm:270:   $ret;
788657867364790503552363213932185062295135977687173263294742533244359449963403342920304284011984623904177212138919638830257642790242637105061926624952829931113462857270763317237396988943922445621451664240254033291864131227428294853277524242407573903240321257405579568660226031904170324062351700858796178922222789623703897374720000000000000000000000000000000000000000000000000

That’s pretty ugly, and the Devel::Trace module doesn’t give you any way to change it. That’s not surprising consider the source code, which is

package Devel::Trace;
$VERSION = '0.11';
$TRACE = 1;

# This is the important part.  The rest is just fluff.
sub DB::DB {
  return unless $TRACE;
  my ($p, $f, $l) = caller;
  my $code = \@{"::_<$f"};
  print STDERR ">> $f:$l: $code->[$l]";
}


sub import {
  my $package = shift;
  foreach (@_) {
    if ($_ eq 'trace') {
      my $caller = caller;
      *{$caller . '::trace'} = \&{$package . '::trace'};
    } else {
      use Carp;
      croak "Package $package does not export `$_'; aborting";
    }
  }
}

my %tracearg = ('on' => 1, 'off' => 0);
sub trace {
  my $arg = shift;
  $arg = $tracearg{$arg} while exists $tracearg{$arg};
  $TRACE = $arg;
}

1;

The magic happens in DB::DB, which, when perl calls before every statement. This module is Devel::Trace, but its defining its meat in the package DB. The debugger stashes lots of information about the program in various package variables. The @{"::_<$f"} is a symbolic reference, using the filename in $f to translate the filename and line number from caller to a line of code.

You don't need the module, really. The import and trace subroutines are there just to turn things on and off. To do that, you'd have to modify your code to litter calls to trace throughout the code. You usually don't want to modify the code.

You can reduce that module to a command line. When you run the debugger, perl inserts the value in the PERL5BD environment variable before the source in your program. The reduces Devel::Trace to:

PERL5DB='sub DB::DB {my @c=caller;print STDERR qq|@c[1,2] ${"::_<$c[1]"}[$c[2]]|}' perl5.14.1 -d fact.pl 4

That's a bit hard to use, especially on a command line that tries to interpret those sigils characters. You don't need to worry about that now that you know how simple it is. You can create your own throw-away module to trace exactly what you want. Since the current directory is usually already in @INC, you can create a Devel/MyTrace.pm. This one saves to the file you specify:

use v5.10;
use autodie;

BEGIN {

my $trace_file = $ENV{TRACE_FILE} // "mytrace.$$";
print STDERR "Saving trace to $trace_file\n";
open my $fh, '>>', $trace_file;

sub DB::DB {
	my( $package, $file, $line ) = caller;
	my $code = \@{"::_<$file"};
	print $fh "[@{[time]}] $file $line $code->[$line]";
	}
}

1;

That is only slightly improved on Devel::Trace because it lets you specify the file in which to save your trace. You can improve that a bit to also handle the special - or STDERR virtual filenames:

use v5.10;
use autodie;

BEGIN {

my $trace_file = $ENV{TRACE_FILE} // "mytrace.$$";
print STDERR "Saving trace to $trace_file\n";

my $fh = do {
	   if( $trace_file eq '-'      ) { \*STDOUT }
	elsif( $trace_file eq 'STDERR' ) { \*STDERR }
	else { 
		open my $fh, '>>', $trace_file;
		$fh;
		}
	};

sub DB::DB {
	my( $package, $file, $line ) = caller;
	my $code = \@{"::_<$file"};
	print $fh "[@{[time]}] $file $l $code->[$line]";
	}
}

1;

You can go one step further to exclude all code except for the primary file:

use v5.10;
use autodie;

BEGIN {

my $trace_file = $ENV{TRACE_FILE} // "mytrace.$$";
print STDERR "Saving trace to $trace_file\n";

my $fh = do {
	   if( $trace_file eq '-'      ) { \*STDOUT }
	elsif( $trace_file eq 'STDERR' ) { \*STDERR }
	else { 
		open my $fh, '>>', $trace_file;
		$fh;
		}
	};

sub DB::DB {
	my( $package, $file, $line ) = caller;
	return unless $file eq $0;
	my $code = \@{"::_<$file"};
	print $fh "[@{[time]}] $file $l $code->[$line]";
	}
}

1;

From there, what your tracer does is up to you and what problem you are trying to track down. It doesn't have to be fancy and you don't have to spend a lot of time making it perfect or "pure".

Things to remember

  • You can trace your Perl program just like a shell program
  • Under the -d switch, perl calls &DB::DB before each statement
  • You can write your own throw-away tracer

Use Data::Printer to debug data structures

You can use several different Perl modules to inspect data structures. Many of these modules, however, are really two tools in one. Besides showing a data structure as a string, they also serialize the data as Perl code so you can reconstruct the data structure. That second job often makes things hard for you. If you don’t need the serialization
job, don’t use a module that insists on it.

The Data::Dumper module is popular because it comes with Perl. Here’s a program that we’ll use for the rest of the Item, save for changes to the module dumping the structure:

use Data::Dumper qw(Dumper);
use DateTime;
use HTTP::Request;

my $request = HTTP::Request->new(
		GET => 'http://www.perl.org',
		);

$request->header( 'X-Perl' => '5.12.2' );
$request->header( 'Cat'    => 'Buster' );

my $data = {
	hash => {
		cat  => 'Buster',
		dog  => 'Addy',
		bird => 'Poppy',
		},
	array => [ qw( a b c ) ],
	datetime => DateTime->now,
	reqeust  => $request,
	};

print Dumper( $data );

The output is a Perl data structure, suitable for eval. That makes it a bit verbose and ugly:

$VAR1 = {
      'array' => [
             'a',
             'b',
             'c'
           ],
      'hash' => {
            'cat' => 'Buster',
            'dog' => 'Addy',
            'bird' => 'Poppy'
          },
      'reqeust' => bless( {
                '_content' => '',
                '_uri' => bless( do{\(my $o = 'http://www.perl.org')}, 'URI::http' ),
                '_headers' => bless( {
                             'cat' => 'Buster',
                             'x-perl' => '5.12.2'
                           }, 'HTTP::Headers' ),
                '_method' => 'GET'
                }, 'HTTP::Request' ),
      'datetime' => bless( {
                 'local_rd_secs' => 68540,
                 'local_rd_days' => 734452,
                 'rd_nanosecs' => 0,
                 'locale' => bless( {
                            'default_time_format_length' => 'medium',
                            'native_territory' => 'United States',
                            'native_language' => 'English',
                            'native_complete_name' => 'English United States',
                            'en_language' => 'English',
                            'id' => 'en_US',
                            'default_date_format_length' => 'medium',
                            'en_complete_name' => 'English United States',
                            'en_territory' => 'United States'
                          }, 'DateTime::Locale::en_US' ),
                 'local_c' => {
                        'hour' => 19,
                        'second' => 20,
                        'month' => 11,
                        'quarter' => 4,
                        'day_of_year' => 315,
                        'day_of_quarter' => 42,
                        'minute' => 2,
                        'day' => 11,
                        'day_of_week' => 5,
                        'year' => 2011
                        },
                 'utc_rd_secs' => 68540,
                 'formatter' => undef,
                 'tz' => bless( {
                          'name' => 'UTC'
                        }, 'DateTime::TimeZone::UTC' ),
                 'utc_year' => 2012,
                 'utc_rd_days' => 734452,
                 'offset_modifier' => 0
                 }, 'DateTime' )
    };

The Data::Dump also serializes, and is a cleaner Data::Dumper. In void context, it automatically prints for you:

use Data::Dump qw(pp);

...; # same $data thing as before

pp( $data );

The output looks a lot like the Data::Dumper output because it has to be a Perl data structure:

{
  array  => ["a", "b", "c"],
  datetime => bless({
        formatter     => undef,
        local_c     => {
                   day => 11,
                   day_of_quarter => 42,
                   day_of_week => 5,
                   day_of_year => 315,
                   hour => 19,
                   minute => 18,
                   month => 11,
                   quarter => 4,
                   second => 33,
                   year => 2011,
                   },
        local_rd_days   => 734452,
        local_rd_secs   => 69513,
        locale      => bless({
                   default_date_format_length => "medium",
                   default_time_format_length => "medium",
                   en_complete_name => "English United States",
                   en_language => "English",
                   en_territory => "United States",
                   id => "en_US",
                   native_complete_name => "English United States",
                   native_language => "English",
                   native_territory => "United States",
                   }, "DateTime::Locale::en_US"),
        offset_modifier => 0,
        rd_nanosecs   => 0,
        tz        => bless({ name => "UTC" }, "DateTime::TimeZone::UTC"),
        utc_rd_days   => 734452,
        utc_rd_secs   => 69513,
        utc_year    => 2012,
        }, "DateTime"),
  hash   => { bird => "Poppy", cat => "Buster", dog => "Addy" },
  reqeust  => bless({
        _content => "",
        _headers => bless({ "cat" => "Buster", "x-perl" => "5.12.2" }, "HTTP::Headers"),
        _method  => "GET",
        _uri   => bless(do{\(my $o = "http://www.perl.org")}, "URI::http"),
        }, "HTTP::Request"),
}

Steven Haryanto added a filter feature to an existing interface, which you can see in Use Data::Dump filters for nicer pretty-printing. You can get better control of the parts you’d want to distill, such as that DateTime:

{
  array => # some items hidden
  [2011-02-03, "d", "...", "n"],
  datetime => 2011-02-03,
  hash => { bird => "Poppy", cat => "Buster", dog => "Addy" },
}

If you forget about the serialization, though, you can do much better. Often, you want to inspect a data structure to see what’s on the inside without saving it for future use. If that’s the case, you don’t need to see the data structure as Perl code and the pretty printer can organizer the data much better and provide more information. The Data::Printer module doesn’t care at all about serialization. In void context, its p automatically prints:

use Data::Printer;

...; # same $data thing as before

p( $data );

The output is as verbose, but it’s also much more dense. When it prints an object, it shows you the methods in the class:

\ {
    array      [
        [0] "a",
        [1] "b",
        [2] "c"
    ],
    datetime   DateTime  {
        public methods (134) : add, add_duration, am_or_pm, bootstrap, ce_year, christian_era, clone, compare, compare_ignore_floating, date, datetime, day, day_abbr, day_name, day_of_month, day_of_month_0, day_of_quarter, day_of_quarter_0, day_of_week, day_of_week_0, day_of_year, day_of_year_0, day_0, DefaultLanguage, DefaultLocale, delta_days, delta_md, delta_ms, dmy, doq, doq_0, dow, dow_0, doy, doy_0, duration_class, epoch, era, era_abbr, era_name, format_cldr, formatter, fractional_second, from_day_of_year, from_epoch, from_object, hires_epoch, hms, hour, hour_1, hour_12, hour_12_0, INFINITY, is_dst, is_finite, is_infinite, is_leap_year, iso8601, jd, language, last_day_of_month, leap_seconds, local_day_of_week, local_rd_as_seconds, local_rd_values, locale, MAX_NANOSECONDS, mday, mday_0, mdy, microsecond, millisecond, min, minute, mjd, mon, mon_0, month, month_abbr, month_name, month_0, NAN, nanosecond, NEG_INFINITY, new, now, offset, quarter, quarter_abbr, quarter_name, quarter_0, sec, second, SECONDS_PER_DAY, secular_era, set, set_day, set_formatter, set_hour, set_locale, set_minute, set_month, set_nanosecond, set_second, set_time_zone, set_year, STORABLE_freeze, STORABLE_thaw, strftime, subtract, subtract_datetime, subtract_datetime_absolute, subtract_duration, time, time_zone, time_zone_long_name, time_zone_short_name, today, truncate, utc_rd_as_seconds, utc_rd_values, utc_year, wday, wday_0, week, week_number, week_of_month, week_year, weekday_of_month, year, year_with_christian_era, year_with_era, year_with_secular_era, ymd
        private methods (38) : _accumulated_leap_seconds, _add_overload, _adjust_for_positive_difference, _calc_local_components, _calc_local_rd, _calc_utc_components, _calc_utc_rd, _cldr_pattern, _compare, _compare_overload, _day_has_leap_second, _day_length, _era_index, _format_nanosecs, _handle_offset_modifier, _is_leap_year, _month_length, _new, _new_from_self, _normalize_leap_seconds, _normalize_nanoseconds, _normalize_seconds, _normalize_tai_seconds, _offset_for_local_datetime, _rd2ymd, _seconds_as_components, _space_padded_string, _string_compare_overload, _string_equals_overload, _string_not_equals_overload, _stringify, _subtract_overload, _time_as_seconds, _utc_hms, _utc_ymd, _weeks_in_year, _ymd2rd, _zero_padded_number
        internals: {
            formatter         undef,
            local_c           {
                day              11,
                day_of_quarter   42,
                day_of_week      5,
                day_of_year      315,
                hour             19,
                minute           41,
                month            11,
                quarter          4,
                second           42,
                year             2011
            },
            local_rd_days     734452,
            local_rd_secs     70902,
            locale            DateTime::Locale::en_US,
            offset_modifier   0,
            rd_nanosecs       0,
            tz                DateTime::TimeZone::UTC,
            utc_rd_days       734452,
            utc_rd_secs       70902,
            utc_year          2012
        }
    },
    hash       {
        bird   "Poppy",
        cat    "Buster",
        dog    "Addy"
    },
    request    HTTP::Request  {
        Parents       HTTP::Message
        Linear @ISA   HTTP::Request, HTTP::Message
        public methods (10) : accept_decodable, as_string, clone, dump, method, new, parse, uri, uri_canonical, url
        private methods (0)
        internals: {
            _content   "",
            _headers   HTTP::Headers,
            _method    "GET",
            _uri       URI::http
        }
    }
}

You probably don’t want to see all that internal gunk from DateTime or HTTP::Request, so you can set filters from them to print them however you like:

use Data::Printer {
    filters => {
       'DateTime'      => sub { "DateTime => $_[0]" },
       'HTTP::Request' => sub { "URL => " . $_[0]->uri }, 
    },
};

...; # same $data thing as before

p( $data );

Now you can see what you need to see much easier:

\ {
    array      [
        [0] "a",
        [1] "b",
        [2] "c"
    ],
    datetime   DateTime => 2011-11-11T19:51:06,
    hash       {
        bird   "Poppy",
        cat    "Buster",
        dog    "Addy"
    },
    request    URL => http://www.perl.org
}

So far, you’ve changed the import list to specify what you wanted, but you can change it each time that you want to dump something:

p( $data, { index => 0 } );

Now you don’t have array indices:

\ {
    array      [
        "a",
        "b",
        "c"
    ],
    datetime   DateTime => 2011-11-11T20:25:22,
    hash       {
        bird   "Poppy",
        cat    "Buster",
        dog    "Addy"
    },
    request    URL => http://www.perl.org
}

You can make colorized output too by setting another property:

use Data::Printer {
	colored => 1,
	filters => {
		'DateTime'      => sub { "DateTime => $_[0]" },
		'HTTP::Request' => sub { "URL => " . $_[0]->uri }, 
	},
};

...; # same $data thing as before

p( $data );

And you can change the colors if you don’t like the default set. You have to choose a valid Term::ANSI color:

use Data::Printer {
	colored => 1,
	color => {
		array       => 'yellow',
		string      => 'cyan',
		hash        => 'green',
	},
	filters => {
		'DateTime'      => sub { "DateTime => $_[0]" },
		'HTTP::Request' => sub { "URL => " . $_[0]->uri }, 
	},
};

...; # same $data thing as before

p( $data );

This might be more pleasing to you:

Lastly, one of the most annoying “features” of a pretty printer is the constant reference passing. Since Perl flattens its argument list into a single list, to maintain data structure identities, you have to pass them as a reference:

use Data::Dumper;
print Dumper( \%hash );
use Data::Dump;
pp( \%hash );

You can do that with Data::Printer too:

use Data::Printer;
p( \%hash );

Data::Printer uses prototypes to make that easier for you. The Dumper and pp each dump a list of structures, but Data::Printer‘s p dumps exactly one structure. As such, it can use prototypes to recognize a whole hash or array as the first argument:

use Data::Printer;
p( %hash );

This feature has a few oddities, but Breno explains them in the documentation.

Things to remember

  • Serialization and inspection are different tasks
  • Most Perl pretty printers try to serialize
  • The Data::Printer

Make grep-like syntax

To create grep- or map-like syntax, you need to use Perl’s prototypes, despite whatever we told you in Understand why you probably don’t need prototypes. Perl needs the special hints that prototypes to parse a block as an argument to a subroutine.

First, remember the forms of grep. There’s a single expression version and a block version:

grep EXPR, @input       # with a comma
grep { ... } @input     # no comma

That block, the {...}, is an inline subroutine where the current element shows up in $_:

my @odds = grep { $_ % 2 } @input;

For either form, there’s a scalar and list return value, depending on context (Item 12. Understand context and how it affects operations):

my @array = grep ...;
my $count = grep ...;

You can make your own subroutines that work just like grep. The prototype character & tells perl to expect a subroutine reference. However, it can only be the first argument. To try it, define a subroutine that takes a single argument, a code reference:

sub run_it (&) {
	my $sub = shift;
	$sub->();
	}

You can call that subroutine in several ways. You can use a block, the sub keyword with a block, or a reference to a subroutine, or a reference to a named subroutine:

use v5.10;

sub named { say "I have a name" }

my $result = run_it { say "Hello!" };
   $result = run_it sub { say "I have a keyword!" };
   $result = run_it \&named;

However, perl is not smart enough to recognize other forms. It won’t like a scalar variable that might have a code reference later, and it can’t take a bareword that is the name of a defined subroutine (like sort will). These are compile-time errors:

use v5.10;

sub named { say "I have a name" }
my $code_ref = \&named;

my $result = run_it $code_ref;
   $result = run_it named;
   $result = run_it &named;

Handling grep’s second argument

Now you can take a code reference as an argument. The next part of the grep syntax is the the input. You could use the @ character to denote a list of arguments (not an array argument), and that appears to work:

use v5.10;
use warnings;

sub do_with_array (&@) { 
	my( $sub, @args ) = @_;
	my @output;
	
	foreach my $elem ( @args ) {
		local $_ = $elem;
		push @output, $sub->();
		}
	return @output;
}

sub other_cats { qw(Ellie Ginger) }
 
my @cats = qw(Buster Mimi Roscoe);

@result = do_with_array { say $_ } @cats;
@result = do_with_array { say $_ } qw(Buster Mimi Roscoe);
@result = do_with_array { say $_ } 1 .. 10;
@result = do_with_array { say $_ } other_cats();

Like grep which can alias $_ to the original data, you can also means that you can change the original data with your subroutine argument if you use @_ (Item 114. Know when arrays are modified in a loop):

use v5.10;
use warnings;

sub do_with_array (&@) {
   my $sub = shift;
   my @output;
   local $_;

   foreach ( @_ ) {
	   push @output, $sub->();
   }
   return @output;
}

my @original   = qw(1 2 3);
my @new        = do_with_array { $_ += 2 } @original;
say "new = @new";             # 3 4 5
say "original = @original";  # 3 4 5

You might try the \@ prototype, but that limits you in other ways. Now perl expects a named array as an argument. You cannot use an array reference, range, literal list, or the return values from a subroutine call. It’s a named array or an error. That’s no good.

Likewise, you might use the + prototype introduced in Perl 5.14. This allows you to use an array or an array reference argument. Perl doesn’t complain if you use a range or a subroutine call, but it also doesn’t do the right thing:

use v5.10;
use warnings;

sub do_with_array (&+) { 
	my( $sub, $array ) = @_;
	my @output;
	
	foreach my $elem ( @$array ) {
		local $_ = $elem;
		push @output, $sub->();
		}
	return @output;
}

sub other_cats { qw(Ellie Ginger) }

my @cats = qw(Buster Mimi Roscoe);

@result = do_with_array { say $_ } @cats;
@result = do_with_array { say $_ } [ 'a' .. 'g' ];
@result = do_with_array { say $_ } 1 .. 10;
@result = do_with_array { say $_ } other_cats();

The named subroutine and the array reference work just fine. Perl does something weird with the range, and the subroutine call appears to not happen at all:

Buster
Mimi
Roscoe
a
b
c
d
e
f
g
Use of uninitialized value $. in range (or flip) at run_it.pl line 21.

Not only that, but the + prototype character also allows named hashes and hash references:

use v5.10;
use warnings;

sub do_with_array (&+) { 
	my( $sub, $array ) = @_;
	my @output;
	
	foreach my $elem ( @$array ) {
		local $_ = $elem;
		push @output, $sub->();
		}
	return @output;
}

my %cats = qw(Buster Mimi Roscoe Ellie);

@result = do_with_array { say $_ } %cats;
@result = do_with_array { say $_ } { 'a' => 'b' };

You don’t get an error until runtime when you try the array dereference:

Not an ARRAY reference at run_it.pl line 8.

You can check these things at runtime, though. This reminds you, as we said in Understand why you probably don’t need prototypes, that prototypes probably don’t do what you think. You end up doing a lot of the work that most people think prototypes do for you:

use v5.10;
use warnings;
use Carp;

sub do_with_array (&+) { 
	my( $sub, $array ) = @_;
	croak "do_with_array takes an array argument"
		unless ref $array eq ref [];

	my @output;
	
	foreach my $elem ( @$array ) {
		local $_ = $elem;
		push @output, $sub->();
		}

	return @output;
}

my @cats = qw(Buster Mimi Roscoe Ellie);
my %cats = map { $_, 1 } @cats;

@result = do_with_array { say $_ } @cats;
@result = do_with_array { say $_ } %cats;

Handling context

We explained context in Item 12. Understand context and how it affects operations, and if you want to emulate grep you have to handle them. In list context grep returns a list, in scalar context it returns a count, and in void context it potentially does nothing. You really only need to handle the void case. In this case, you’ll simply return without doing anything:

use v5.10;
use warnings;
use Carp;

sub do_with_array (&+) {
	return unless defined wantarray;
	my( $sub, $array ) = @_;
	croak "do_with_array takes an array argument"
		unless ref $array eq ref [];

	my @output;
	
	foreach my $elem ( @$array ) {
		local $_ = $elem;
		push @output, $sub->();
		}

	return @output;
}

The list and scalar contexts come from returning an array. When you return a named array, you get the same results as assigning an array. In list context you get the list elements, and in scalar context you get the count. If you want to return something different, such as a list instead of a named array, you have to do more work:

	return wantarray ? qw( a b c ) : 3;

So, once again, prototypes half solve the problem, but leave you with more work to do.

Things to remember

  • Use the & prototype character to specify a code reference argument
  • If the code reference argument argument is the first argument, you can leave off the sub keyword
  • The reference has to be a block of code or a reference to a named subroutine, and specifically not a scalar variable

Profile with Devel::NYTProf

Profile before you decide where to optimize—you might be surprised where you’re losing all of your performance.

We won’t go into all the details of profiling in this Item, but you can read about those in Mastering Perl. In short, profilers count something then report the results. They can track any of the things that you might care about. The Devel::NYTProf module, like most profilers, tracks time, counting the statements you run and how long they take.

Basic profiling

Perl profilers are really debuggers, using the internal hooks to detect movement in and out of statements. As such, you can load it with the -d switch just like any other debugger:

% perl -d:NYTProf some_program

You can also load it as a module and NYTProf will figure it out:

% perl -MDevel::NYTProf some_program

Either way, you can set the NYTProf environment variable to specify various settings (which you’ll have to read the documentation). For instance, to start profiling after the compilation phase, you use the start field:

% export NYTPROF=start=init
% perl -MDevel::NYTProf some_program

That’s it. Your program will run slower as NYTProf watches it and records that data, so don’t worry about the wallclock time. If your program is already agonizingly slow, it’s going to seem worse. Perhaps not a lot worse, but you’ll probably notice the difference. Maybe it’s time to get a cup of coffee.

The reports

NYTProf stores its data in the ./nytprof.out file (although you can change that). That’s just the data, though. To look at it, you need to convert it to a particular format. The distribution comes with several programs to do that already. The easiest and most popular might be HTML:

% nytprofhtml
Reading nytprof.out
Processing nytprof.out data
Writing sub reports to nytprof directory
 100% ... 
Writing block reports to nytprof directory
 100% ... 
Writing line reports to nytprof directory
 100% ... 

Inside the nytprof/ directory, you’ll find index.html, which presents the summary of the profile:

Here’s a full profile from the Pod::Perldoc module looking through perlfunc.

Call graphs

NYTProf also generates call graphs that you can inspect with Graphviz.

Use lookarounds to split to avoid special cases

There are some regular expression tricks that can help you deal with balanced delimiters in a string. The split command takes a pattern, removes the parts of a string that match that pattern, and give you a list of the parts of the string between those separators. Said another way, split works when the parts you don’t need are between the values.

Single character separators are easy:

use v5.10;

my @letters = split /:/, 'a:b:c:d:e';
say "@letters";

The list comes out just as you expect:

a b c d e

Even multiple or variable width patterns are fine:

use v5.10;

my @cats = split /\s+/, 'Buster 
	Mimi     Roscoe';
say "@cats";

The list comes out just as you expect:

Buster Mimi Roscoe

It gets more tricky when you have balanced delimiters, when there’s something that marks the start and the end of a value. The problem is that there is something in front of the first element and something after the last element. You can’t split on the pattern of characters between the values because you don’t remove everything:

use v5.10;

my @cats = split /></, '<Buster><Mimi><Roscoe>';
say "@cats";

The first and last delimiter characters are still attached to their values:

<Buster Mimi Roscoe>

You might be tempted to live with that and process those values after the split:

use v5.10;

my @cats = split /></, '<Buster><Mimi><Roscoe>';
$cats[0] =~ s/<//;
$cats[-1] =~ s/>//;
say "@cats";

Some people might be satisfied with that, and it does work, but it’s much better to remove the special cases. If you limit yourself to matching just the character that you want to remove, you’re a bit limited. One problem is the empty leading field that you get if you try to match the first delimiter character:

use v5.10;

my @cats = split /><|\A<|>\z/, '<Buster><Mimi><Roscoe>';

say "@cats";

There’s a space at the beginning of the output because there’s an empty leading field, but the list at least doesn’t have any of the delimiter characters:

 Buster Mimi Roscoe

To fix this, you still need to handle the leading field, perhaps by shifting it off. Again, this works, even if it’s unsightly:

use v5.10;

my @cats = split /><|\A<|>\z/, '<Buster><Mimi><Roscoe>';
shift @cats;

say "@cats";

The special processing isn’t as bad, but you have to remember to handle that one element.

Instead of matching characters, you can use lookarounds to split on the the middle of the balanced delimiter by using a zero-width assertion. The lookarounds match a condition in the string but do not consume any characters. These are conditions in the string, not characters to match.

If you use a lookbehind next to a lookahead, you can split on the position in the string where both conditions match. You want to match in the middle of a >< so the > ends up with the preceding element and the < stays with the succeeding element.

The positive lookbehind has the general form (?<=PATTERN). That pattern, which must be fixed-width, must match before the position. In this case, you want to match a > before the position, so the assertion is (?<=>).

The positive lookahead is almost the same thing, with the form (?=PATTERN). You want to match a < after the position, so your assertion is (?=<).

Putting them together, the lookbehind next to the lookahead, splits the values:

use v5.10;

my @cats = split /(?<=>)(?=<)/, '<Buster><Mimi><Roscoe>';

say "@cats";

The output list still has the delimiter characters, but now each element needs the same processing, so there are no special cases:

<Buster> <Mimi> <Roscoe>

Once you have the values in their own elements, you can remove the delimiters:

use v5.14;

my @cats = 
	map { s/\A<|>\z//rg }    # return the modified value
	split /(?<=>)(?=<)/, 
	'<Buster><Mimi><Roscoe>';

say "@cats";

That might seem a bit silly, but we’re only using a simple example to illustrate the point.

Consider a slightly more complicated case, where the fields are quoted, but then separated by commas. Unless your learning to re-invent the wheel (a valid exercise to sharpen your skills), you should probably use a module (Item 115. Don’t use regular expressions for comma-separated values). For this example, you’ll do it yourself:

use v5.10;

my @cats = 
	split /(?<="),(?=")/, 
	'"Buster","Mimi","Roscoe"';

say "@cats";

This removes the commas, as long as they are between quotes. However, you leave the quotes in place so you don't treat the first and last values specially:

"Buster" "Mimi" "Roscoe"

To get rid of the quotes, you process each item in the same way:

use v5.14;

my @cats = 
	map { s/\A"|"\z//rg }       # return the modified value
	split /(?<="),(?=")/, 
	'"Buster","Mimi","Roscoe"';

say "@cats";

You might try to construct a more complicated regular expression to also remove the quotes, but that's going to be harder to read and maintain than doing it in two simple steps.

Things to remember

  • You don't have to remove delimiters in one step
  • You can use a lookbehind next to a lookahead to specify a position in a string

Understand why you probably don’t need prototypes

You should understand how Perl’s prototypes work, not so you’ll use them but so you won’t be tempted to use them. Although prototypes can solve some problems, they don’t solve the problems most people want.

Some languages, such as C, have function prototypes. You tell your function how many arguments it has and what sort they are, as well as what type of thing it returns:

char* some_function( int start, int length );

Java has a method signature:

public class SomeClass {
   public String mySubstr( String aString, int i, int j ) {
     ...
   }
 }

Aside from syntax-warping modules, such as Devel::Declare, or source filters, such as Filter::Simple, Perl doesn’t have that as a fundamental feature. It’s subroutines and methods take lists and return lists (or a single item).

Perl does have prototypes as a compile-time aid, documented in perlsub. It’s not there to ensure you give a subroutines particular sorts of arguments but to help the compiler figure out what you typed and how you want it to interpret it. Perl doesn’t require you to surround your arguments in parentheses, so prototypes gives you a way to tell the compiler where the arguments start and end. Consider these examples, which you’ll understand by the end of this Item:

use utf8;

my $value = π +1;
my @array = ( sin π*5/2, cos π, 1, 2, 3 );

A lesser part of that includes the specification of Perl core types (scalar, array, hash, subroutine, or globs). Method calls, which require the parentheses to surround their arguments, completely ignore prototypes because Perl doesn’t need any help to parse them.

You (optionally) specify the prototype after the subroutine name. The simplest prototype is the empty prototype, meaning the subroutine takes no arguments:

use utf8;

sub TRUE  () { 1 }
sub FALSE () { 0 }
sub π     () { 3.1415926 }

The simplest prototype

The empty prototype tells perl not to consume any arguments when it sees that function name. How does perl know how to interpret this?

use utf8;

say π +1;

Since π is a subroutine, it can take arguments. Since you wrote it without parentheses, perl needs a hint to parse that. It could be two forms, each with possibly different answers:

use utf8;

say π( +1 );
say π() + 1;

The empty prototype tells perl to parse it as π() + 1. This makes the empty prototype a way that you can declare constants.

This means, however, that perl needs to know about the prototype before it parses that bit of code. This works because the prototype shows up first because the subroutine is completely defined before it’s called:

use utf8;

sub π () { 3.1415926 }
say π +1;   # 4.1415926

You don’t need to define the subroutine ahead of time, but you have to declare its prototype to get the behavior that you expect:

use utf8;

sub π ();
say π +1;   # 4.1415926

BEGIN {
*π = sub { 3.1415926 }
}

This isn’t a recommendation to write code like this, but it illustrates the point. Even with warnings turned off, you’ll still get a warning about the mismatch in prototypes:

use v5.10;
use utf8;

sub π ();
say π +1;   # 4.1415926

BEGIN {
*π = sub { 3.1415926 }
}

You can make the prototypes in the forward definition and the full definitions match:

use v5.10;
use utf8;

sub π ();
say π +1;   # 4.1415926

BEGIN {
	*π = sub (){ 3.1415926 }
	}

The prototype matters only for the calls to the subroutines after its definition. The prototype doesn’t matter for subroutine calls before its definition. However, to use the subroutine as a bareword, you still have to have a forward declaration:

use v5.10;
use utf8;

sub π;
say π +1;   # 3.1415926

sub π ();
say π +1;   # 4.1415926

BEGIN {
	*π = sub (){ 3.1415926 }
	}

Not only that, we can change the prototype as perl parses your program.

use v5.10;
use utf8;

sub π;
say π +1;

sub π ();
say π +1;   # 4.1415926

sub π ($$);
say π +1;   

BEGIN {
	*π = sub (){ 3.1415926 }
	}

The sub π ($$) tells perl to expect two arguments for the subsequent calls. Since you don’t give it enough arguments

Prototype mismatch: sub main::π () vs ($$) at proto.pl line 10.
Not enough arguments for main::π at proto.pl line 11, near "1;"
BEGIN not safe after errors--compilation aborted at proto.pl line 15.

More than zero arguments

To take more or more arguments, you just specify that number of items in the prototype. So far, you’ve only seen scalar arguments, which you specify as a $ in the prototype:

sub twofer    ($$);    # exactly two arguments
sub hat_trick ($$$);   # exactly three arguments

This does not mean that the subroutine gets that number of arguments. It does not mean that it takes two scalar variables as arguments. Perl parses each argument in scalar context:

use v5.10;

sub twofer ($$) { say "@_" };

my @array = qw( Buster Mimi Roscoe );
twofer @array, 2;

my %hash = map { $_ => 1 } 'a' .. 'z';
twofer %hash, 2;

The @array is a single argument, the first one, and is taken in scalar context, giving the number of elements in it. The %hash is treated in the same way, providing the mostly useless “hash statistics” scalar value:

3 2
19/32 2

Putting a \ in front of a prototype character specifies that the argument is a named variable. Instead of the value, you get a reference to the value:

use v5.10;

sub twofer (\$) { say "@_" };

my $scalar = 'Buster';
twofer $scalar;

The output shows the reference:

SCALAR(0x10082e548)

If you try to give it a non-variable, perl complains at compile-time:

Type of arg 1 to main::twofer must be scalar (not constant item) at proto.pl line 6, near "2;"
Execution of proto.pl aborted due to compilation errors.

A non-backslashed @ in a prototype specifies a list and forces list context on the rest of the arguments. It does not require and array argument:

use v5.10;

sub twofer (@) { say "@_" };

my @array = qw( Buster Mimi Roscoe );
twofer @array, 2;

my %hash = map { $_ => 1 } 'a' .. 'z';
twofer %hash, 2;

The output shows the normal list flattening behavior you expect from a Perl subroutine call. Notice that it does not care about the number or type of arguments:

Buster Mimi Roscoe 2
w 1 r 1 a 1 x 1 d 1 j 1 y 1 u 1 k 1 h 1 g 1 f 1 t 1 i 1 e 1 n 1 v 1 m 1 s 1 l 1 c 1 p 1 q 1 b 1 z 1 o 1 2

If you wanted to keep the array together, you would put a backslash in front of the \@. The argument must be a named array, and not an anonymous array or a reference to an array. Even though the argument is an array, the value in @_ will be a reference to that array:

use v5.10;

sub twofer (\@$) { say "@_" };

my @array = qw( Buster Mimi Roscoe );
twofer @array, 2;

The output shows two arguments:

ARRAY(0x100827810) 2

You can’t then sneak in a scalar variable or a hash variable:

use v5.10;

sub twofer (\@$) { say "@_" };

my @array = qw( Buster Mimi Roscoe );
twofer @array, 2;

my %hash = map { $_ => 1 } 'a' .. 'z';
twofer %hash, 2;

perl catches that at compile-time:

Type of arg 1 to main::twofer must be array (not private hash) at proto.pl line 9, near "2;"
Execution of proto.pl aborted due to compilation errors.

If you want to take more than one type of argument at a particular position, you can specify the possible types in brackets. To take either an array or a hash, you use [@%]:

use v5.10;

sub twofer (\[@%]$) { say "@_" };

my @array = qw( Buster Mimi Roscoe );
twofer @array, 2;

my %hash = map { $_ => 1 } 'a' .. 'z';
twofer %hash, 2;

Now the output takes either:

ARRAY(0x100827810) 2
HASH(0x10082e0c8) 2

If you want to take two separate arrays,

use v5.10;

sub twofer (\@\@) { say "@_" };

my @array1 = qw( Buster Mimi Roscoe );
my @array2 = qw( Ginger Ellie );
twofer @array1, @array2;

You get one reference for each array:

ARRAY(0x100827810) ARRAY(0x10082dff0)

Even though you can specify the variable type with the backslashed form, you can’t specify anything about the values that they hold, or limits to the number of elements they contain.

You can also specify prototypes for subroutines and globs, which we’ll cover in a separate Item since you can have a lot more fun with those.

Optional arguments

So far, you’ve used prototypes that specify an exact number of elements. If you want to specify optional arguments, you can divide the mandatory and optional prototype characters with a semicolon. If you wanted to take at least two but possible three arguments, you’d use the prototype ($$;$)

use v5.10;

sub hat_trick ($$;$) { say "@_" };

hat_trick 'Buster', 'Mimi';
hat_trick 'Buster', 'Mimi', 'Roscoe';

Both of those work just fine, but if you try to give it four arguments, you get a compilation error telling you that there are “Too many arguments”.

If you want a minimum number of arguments, but no maximum, you can use the @ as the optional argument. All of these are fine:

use v5.10;

sub hat_trick ($$;@) { say "@_" };

hat_trick 'Buster', 'Mimi';
hat_trick 'Buster', 'Mimi', 'Roscoe';
hat_trick 'Buster', 'Mimi', 'Roscoe', 'Ginger';

Here are some interesting prototypes from List::MoreUtils:

# from List::MoreUtils
sub each_array (\@;\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@)

sub natatime ($@)

sub mesh (\@\@;\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@)

A final warning

Subroutine prototypes exist chiefly so perl can parse calls to your subroutines just like it would its built-ins—without parentheses. They can set the context for the arguments or the variable types, but they can’t specify the sorts of values. Prototypes aren’t the tools that you want if any of those are your goal. It’s also easier to just use parentheses to mark your argument list.

Things to remember

  • Prototypes are not function signatures
  • Non-backslashed prototype characters enforce a context, not a type
  • Backslashed prototype characters enforce a variable type
  • You can specify optional arguments after a semicolon in the prototype.
7ads6x98y