Trace your Perl programs

You can write your own mini (or micro) debuggers to watch your program run. You might want to do this when the other Perl debuggers are too heavyweight (or even too interactive) for your immediate problem.

The Devel::Trace module is a simple debugging tool that lets you watch the flow of control as your program runs. It’s not a debugger and you don’t get to interact with it, but it can be useful in some operations.

If you were a shell programmer in a former life, you probably used the -x switch to watch your shell script execute. Here’s simple script:

#!/bin/sh

echo $HOME
cd ~/Dev/effective-perl-programming
du -s

When you run it with the -x, it prints each line of the script before it executes it:

% sh -x trace.sh
+ echo /Users/brian
/Users/brian
+ cd /Users/brian/Dev/effective-perl-programming
+ du -s
200320  .

You can do the same thing with a Perl program. Here’s a small program to find factorials:

use v5.10;
use strict;
use warnings;

my $n = $ARGV[0];

my $product = 1;

foreach ( 1 .. $n ) {
	$product *= $_;
	}

say $product;

The output shows each line as perl executes it:

% perl -d:Trace factorial.pl
>> factorial.pl:5: my $n = $ARGV[0];
>> factorial.pl:7: my $product = 1;
>> factorial.pl:9: foreach ( 1 .. $n ) {
>> factorial.pl:10: 	$product *= $_;
>> factorial.pl:10: 	$product *= $_;
>> factorial.pl:10: 	$product *= $_;
>> factorial.pl:13: say $product;
6

Devel::Trace doesn’t limit itself to your program though. If you use the bignum module, the multiplication happens on objects instead of Perl numbers:

use v5.10;
use strict;
use warnings;
use bignum;

my $n = $ARGV[0];

my $product = 1;

foreach ( 1 .. $n ) {
	$product *= $_;
	}

say $product;

Try it with a number that requires bignum, there’s quite a bit of output:

% perl -d:Trace fact.pl 200
>> fact.pl:6: my $n = $ARGV[0];
>> fact.pl:8: my $product = 1;
>> fact.pl:10: foreach ( 1 .. $n ) {
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt.pm:152: '0+' => sub { $_[0]->numify(); }
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt.pm:839:   my $x = shift; $x = $class->new($x) unless ref $x;
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt.pm:839:   my $x = shift; $x = $class->new($x) unless ref $x;
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt.pm:841:   return $x->bstr() if $x->{sign} !~ /^[+-]$/;
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt.pm:842:   my $num = $CALC->_num($x->{value});
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt/Calc.pm:276:     my $x = $_[1];
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt/Calc.pm:278:     return 0 + $x->[0] if scalar @$x == 1;      # below $BASE
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt.pm:843:   return -$num if $x->{sign} eq '-';
.... # about 23,800 more lines
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt/Calc.pm:268:     $l--;
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt/Calc.pm:267:     $ret .= substr($z.$ar->[$l],-$BASE_LEN); # fastest way I could think of
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt/Calc.pm:268:     $l--;
>> /usr/local/perls/perl-5.14.1/lib/5.14.1/Math/BigInt/Calc.pm:270:   $ret;
788657867364790503552363213932185062295135977687173263294742533244359449963403342920304284011984623904177212138919638830257642790242637105061926624952829931113462857270763317237396988943922445621451664240254033291864131227428294853277524242407573903240321257405579568660226031904170324062351700858796178922222789623703897374720000000000000000000000000000000000000000000000000

That’s pretty ugly, and the Devel::Trace module doesn’t give you any way to change it. That’s not surprising consider the source code, which is

package Devel::Trace;
$VERSION = '0.11';
$TRACE = 1;

# This is the important part.  The rest is just fluff.
sub DB::DB {
  return unless $TRACE;
  my ($p, $f, $l) = caller;
  my $code = \@{"::_<$f"};
  print STDERR ">> $f:$l: $code->[$l]";
}

sub import {
  my $package = shift;
  foreach (@_) {
    if ($_ eq 'trace') {
      my $caller = caller;
      *{$caller . '::trace'} = \&{$package . '::trace'};
    } else {
      use Carp;
      croak "Package $package does not export `$_'; aborting";
    }
  }
}

my %tracearg = ('on' => 1, 'off' => 0);
sub trace {
  my $arg = shift;
  $arg = $tracearg{$arg} while exists $tracearg{$arg};
  $TRACE = $arg;
}

1;

The magic happens in DB::DB, which, when perl calls before every statement. This module is Devel::Trace, but its defining its meat in the package DB. The debugger stashes lots of information about the program in various package variables. The @{"::_<$f"} is a symbolic reference, using the filename in $f to translate the filename and line number from caller to a line of code.

You don't need the module, really. The import and trace subroutines are there just to turn things on and off. To do that, you'd have to modify your code to litter calls to trace throughout the code. You usually don't want to modify the code.

You can reduce that module to a command line. When you run the debugger, perl inserts the value in the PERL5BD environment variable before the source in your program. The reduces Devel::Trace to:

PERL5DB='sub DB::DB {my @c=caller;print STDERR qq|@c[1,2] ${"::_<$c[1]"}[$c[2]]|}' perl5.14.1 -d fact.pl 4

That's a bit hard to use, especially on a command line that tries to interpret those sigils characters. You don't need to worry about that now that you know how simple it is. You can create your own throw-away module to trace exactly what you want. Since the current directory is usually already in @INC, you can create a Devel/MyTrace.pm. This one saves to the file you specify:

use v5.10;
use autodie;

BEGIN {

my $trace_file = $ENV{TRACE_FILE} // "mytrace.$$";
print STDERR "Saving trace to $trace_file\n";
open my $fh, '>>', $trace_file;

sub DB::DB {
	my( $package, $file, $line ) = caller;
	my $code = \@{"::_<$file"};
	print $fh "[@{[time]}] $file $line $code->[$line]";
	}
}

1;

That is only slightly improved on Devel::Trace because it lets you specify the file in which to save your trace. You can improve that a bit to also handle the special - or STDERR virtual filenames:

use v5.10;
use autodie;

BEGIN {

my $trace_file = $ENV{TRACE_FILE} // "mytrace.$$";
print STDERR "Saving trace to $trace_file\n";

my $fh = do {
	   if( $trace_file eq '-'      ) { \*STDOUT }
	elsif( $trace_file eq 'STDERR' ) { \*STDERR }
	else {
		open my $fh, '>>', $trace_file;
		$fh;
		}
	};

sub DB::DB {
	my( $package, $file, $line ) = caller;
	my $code = \@{"::_<$file"};
	print $fh "[@{[time]}] $file $l $code->[$line]";
	}
}

1;

You can go one step further to exclude all code except for the primary file:

use v5.10;
use autodie;

BEGIN {

my $trace_file = $ENV{TRACE_FILE} // "mytrace.$$";
print STDERR "Saving trace to $trace_file\n";

my $fh = do {
	   if( $trace_file eq '-'      ) { \*STDOUT }
	elsif( $trace_file eq 'STDERR' ) { \*STDERR }
	else {
		open my $fh, '>>', $trace_file;
		$fh;
		}
	};

sub DB::DB {
	my( $package, $file, $line ) = caller;
	return unless $file eq $0;
	my $code = \@{"::_<$file"};
	print $fh "[@{[time]}] $file $l $code->[$line]";
	}
}

1;

From there, what your tracer does is up to you and what problem you are trying to track down. It doesn't have to be fancy and you don't have to spend a lot of time making it perfect or "pure".

Things to remember

  • You can trace your Perl program just like a shell program
  • Under the -d switch, perl calls &DB::DB before each statement
  • You can write your own throw-away tracer

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Use Data::Printer to debug data structures

You can use several different Perl modules to inspect data structures. Many of these modules, however, are really two tools in one. Besides showing a data structure as a string, they also serialize the data as Perl code so you can reconstruct the data structure. That second job often makes things hard for you. If you don’t need the serialization
job, don’t use a module that insists on it.

The Data::Dumper module is popular because it comes with Perl. Here’s a program that we’ll use for the rest of the Item, save for changes to the module dumping the structure:

use Data::Dumper qw(Dumper);
use DateTime;
use HTTP::Request;

my $request = HTTP::Request->new(
		GET => 'http://www.perl.org',
		);

$request->header( 'X-Perl' => '5.12.2' );
$request->header( 'Cat'    => 'Buster' );

my $data = {
	hash => {
		cat  => 'Buster',
		dog  => 'Addy',
		bird => 'Poppy',
		},
	array => [ qw( a b c ) ],
	datetime => DateTime->now,
	reqeust  => $request,
	};

print Dumper( $data );

The output is a Perl data structure, suitable for eval. That makes it a bit verbose and ugly:

$VAR1 = {
      'array' => [
             'a',
             'b',
             'c'
           ],
      'hash' => {
            'cat' => 'Buster',
            'dog' => 'Addy',
            'bird' => 'Poppy'
          },
      'reqeust' => bless( {
                '_content' => '',
                '_uri' => bless( do{\(my $o = 'http://www.perl.org')}, 'URI::http' ),
                '_headers' => bless( {
                             'cat' => 'Buster',
                             'x-perl' => '5.12.2'
                           }, 'HTTP::Headers' ),
                '_method' => 'GET'
                }, 'HTTP::Request' ),
      'datetime' => bless( {
                 'local_rd_secs' => 68540,
                 'local_rd_days' => 734452,
                 'rd_nanosecs' => 0,
                 'locale' => bless( {
                            'default_time_format_length' => 'medium',
                            'native_territory' => 'United States',
                            'native_language' => 'English',
                            'native_complete_name' => 'English United States',
                            'en_language' => 'English',
                            'id' => 'en_US',
                            'default_date_format_length' => 'medium',
                            'en_complete_name' => 'English United States',
                            'en_territory' => 'United States'
                          }, 'DateTime::Locale::en_US' ),
                 'local_c' => {
                        'hour' => 19,
                        'second' => 20,
                        'month' => 11,
                        'quarter' => 4,
                        'day_of_year' => 315,
                        'day_of_quarter' => 42,
                        'minute' => 2,
                        'day' => 11,
                        'day_of_week' => 5,
                        'year' => 2011
                        },
                 'utc_rd_secs' => 68540,
                 'formatter' => undef,
                 'tz' => bless( {
                          'name' => 'UTC'
                        }, 'DateTime::TimeZone::UTC' ),
                 'utc_year' => 2012,
                 'utc_rd_days' => 734452,
                 'offset_modifier' => 0
                 }, 'DateTime' )
    };

The Data::Dump also serializes, and is a cleaner Data::Dumper. In void context, it automatically prints for you:

use Data::Dump qw(pp);

...; # same $data thing as before

pp( $data );

The output looks a lot like the Data::Dumper output because it has to be a Perl data structure:

{
  array  => ["a", "b", "c"],
  datetime => bless({
        formatter     => undef,
        local_c     => {
                   day => 11,
                   day_of_quarter => 42,
                   day_of_week => 5,
                   day_of_year => 315,
                   hour => 19,
                   minute => 18,
                   month => 11,
                   quarter => 4,
                   second => 33,
                   year => 2011,
                   },
        local_rd_days   => 734452,
        local_rd_secs   => 69513,
        locale      => bless({
                   default_date_format_length => "medium",
                   default_time_format_length => "medium",
                   en_complete_name => "English United States",
                   en_language => "English",
                   en_territory => "United States",
                   id => "en_US",
                   native_complete_name => "English United States",
                   native_language => "English",
                   native_territory => "United States",
                   }, "DateTime::Locale::en_US"),
        offset_modifier => 0,
        rd_nanosecs   => 0,
        tz        => bless({ name => "UTC" }, "DateTime::TimeZone::UTC"),
        utc_rd_days   => 734452,
        utc_rd_secs   => 69513,
        utc_year    => 2012,
        }, "DateTime"),
  hash   => { bird => "Poppy", cat => "Buster", dog => "Addy" },
  reqeust  => bless({
        _content => "",
        _headers => bless({ "cat" => "Buster", "x-perl" => "5.12.2" }, "HTTP::Headers"),
        _method  => "GET",
        _uri   => bless(do{\(my $o = "http://www.perl.org")}, "URI::http"),
        }, "HTTP::Request"),
}

Steven Haryanto added a filter feature to an existing interface, which you can see in Use Data::Dump filters for nicer pretty-printing. You can get better control of the parts you’d want to distill, such as that DateTime:

{
  array => # some items hidden
  [2011-02-03, "d", "...", "n"],
  datetime => 2011-02-03,
  hash => { bird => "Poppy", cat => "Buster", dog => "Addy" },
}

If you forget about the serialization, though, you can do much better. Often, you want to inspect a data structure to see what’s on the inside without saving it for future use. If that’s the case, you don’t need to see the data structure as Perl code and the pretty printer can organizer the data much better and provide more information. The Data::Printer module doesn’t care at all about serialization. In void context, its p automatically prints:

use Data::Printer;

...; # same $data thing as before

p( $data );

The output is as verbose, but it’s also much more dense. When it prints an object, it shows you the methods in the class:

\ {
    array      [
        [0] "a",
        [1] "b",
        [2] "c"
    ],
    datetime   DateTime  {
        public methods (134) : add, add_duration, am_or_pm, bootstrap, ce_year, christian_era, clone, compare, compare_ignore_floating, date, datetime, day, day_abbr, day_name, day_of_month, day_of_month_0, day_of_quarter, day_of_quarter_0, day_of_week, day_of_week_0, day_of_year, day_of_year_0, day_0, DefaultLanguage, DefaultLocale, delta_days, delta_md, delta_ms, dmy, doq, doq_0, dow, dow_0, doy, doy_0, duration_class, epoch, era, era_abbr, era_name, format_cldr, formatter, fractional_second, from_day_of_year, from_epoch, from_object, hires_epoch, hms, hour, hour_1, hour_12, hour_12_0, INFINITY, is_dst, is_finite, is_infinite, is_leap_year, iso8601, jd, language, last_day_of_month, leap_seconds, local_day_of_week, local_rd_as_seconds, local_rd_values, locale, MAX_NANOSECONDS, mday, mday_0, mdy, microsecond, millisecond, min, minute, mjd, mon, mon_0, month, month_abbr, month_name, month_0, NAN, nanosecond, NEG_INFINITY, new, now, offset, quarter, quarter_abbr, quarter_name, quarter_0, sec, second, SECONDS_PER_DAY, secular_era, set, set_day, set_formatter, set_hour, set_locale, set_minute, set_month, set_nanosecond, set_second, set_time_zone, set_year, STORABLE_freeze, STORABLE_thaw, strftime, subtract, subtract_datetime, subtract_datetime_absolute, subtract_duration, time, time_zone, time_zone_long_name, time_zone_short_name, today, truncate, utc_rd_as_seconds, utc_rd_values, utc_year, wday, wday_0, week, week_number, week_of_month, week_year, weekday_of_month, year, year_with_christian_era, year_with_era, year_with_secular_era, ymd
        private methods (38) : _accumulated_leap_seconds, _add_overload, _adjust_for_positive_difference, _calc_local_components, _calc_local_rd, _calc_utc_components, _calc_utc_rd, _cldr_pattern, _compare, _compare_overload, _day_has_leap_second, _day_length, _era_index, _format_nanosecs, _handle_offset_modifier, _is_leap_year, _month_length, _new, _new_from_self, _normalize_leap_seconds, _normalize_nanoseconds, _normalize_seconds, _normalize_tai_seconds, _offset_for_local_datetime, _rd2ymd, _seconds_as_components, _space_padded_string, _string_compare_overload, _string_equals_overload, _string_not_equals_overload, _stringify, _subtract_overload, _time_as_seconds, _utc_hms, _utc_ymd, _weeks_in_year, _ymd2rd, _zero_padded_number
        internals: {
            formatter         undef,
            local_c           {
                day              11,
                day_of_quarter   42,
                day_of_week      5,
                day_of_year      315,
                hour             19,
                minute           41,
                month            11,
                quarter          4,
                second           42,
                year             2011
            },
            local_rd_days     734452,
            local_rd_secs     70902,
            locale            DateTime::Locale::en_US,
            offset_modifier   0,
            rd_nanosecs       0,
            tz                DateTime::TimeZone::UTC,
            utc_rd_days       734452,
            utc_rd_secs       70902,
            utc_year          2012
        }
    },
    hash       {
        bird   "Poppy",
        cat    "Buster",
        dog    "Addy"
    },
    request    HTTP::Request  {
        Parents       HTTP::Message
        Linear @ISA   HTTP::Request, HTTP::Message
        public methods (10) : accept_decodable, as_string, clone, dump, method, new, parse, uri, uri_canonical, url
        private methods (0)
        internals: {
            _content   "",
            _headers   HTTP::Headers,
            _method    "GET",
            _uri       URI::http
        }
    }
}

You probably don’t want to see all that internal gunk from DateTime or HTTP::Request, so you can set filters from them to print them however you like:

use Data::Printer {
    filters => {
       'DateTime'      => sub { "DateTime => $_[0]" },
       'HTTP::Request' => sub { "URL => " . $_[0]->uri },
    },
};

...; # same $data thing as before

p( $data );

Now you can see what you need to see much easier:

\ {
    array      [
        [0] "a",
        [1] "b",
        [2] "c"
    ],
    datetime   DateTime => 2011-11-11T19:51:06,
    hash       {
        bird   "Poppy",
        cat    "Buster",
        dog    "Addy"
    },
    request    URL => http://www.perl.org
}

So far, you’ve changed the import list to specify what you wanted, but you can change it each time that you want to dump something:

p( $data, { index => 0 } );

Now you don’t have array indices:

\ {
    array      [
        "a",
        "b",
        "c"
    ],
    datetime   DateTime => 2011-11-11T20:25:22,
    hash       {
        bird   "Poppy",
        cat    "Buster",
        dog    "Addy"
    },
    request    URL => http://www.perl.org
}

You can make colorized output too by setting another property:

use Data::Printer {
	colored => 1,
	filters => {
		'DateTime'      => sub { "DateTime => $_[0]" },
		'HTTP::Request' => sub { "URL => " . $_[0]->uri },
	},
};

...; # same $data thing as before

p( $data );

And you can change the colors if you don’t like the default set. You have to choose a valid Term::ANSI color:

use Data::Printer {
	colored => 1,
	color => {
		array       => 'yellow',
		string      => 'cyan',
		hash        => 'green',
	},
	filters => {
		'DateTime'      => sub { "DateTime => $_[0]" },
		'HTTP::Request' => sub { "URL => " . $_[0]->uri },
	},
};

...; # same $data thing as before

p( $data );

This might be more pleasing to you:

Lastly, one of the most annoying “features” of a pretty printer is the constant reference passing. Since Perl flattens its argument list into a single list, to maintain data structure identities, you have to pass them as a reference:

use Data::Dumper;
print Dumper( \%hash );
use Data::Dump;
pp( \%hash );

You can do that with Data::Printer too:

use Data::Printer;
p( \%hash );

Data::Printer uses prototypes to make that easier for you. The Dumper and pp each dump a list of structures, but Data::Printer‘s p dumps exactly one structure. As such, it can use prototypes to recognize a whole hash or array as the first argument:

use Data::Printer;
p( %hash );

This feature has a few oddities, but Breno explains them in the documentation.

Things to remember

  • Serialization and inspection are different tasks
  • Most Perl pretty printers try to serialize
  • The Data::Printer

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Make grep-like syntax

To create grep- or map-like syntax, you need to use Perl’s prototypes, despite whatever we told you in Understand why you probably don’t need prototypes. Perl needs the special hints that prototypes to parse a block as an argument to a subroutine.

First, remember the forms of grep. There’s a single expression version and a block version:

grep EXPR, @input       # with a comma
grep { ... } @input     # no comma

That block, the {...}, is an inline subroutine where the current element shows up in $_:

my @odds = grep { $_ % 2 } @input;

For either form, there’s a scalar and list return value, depending on context (Item 12. Understand context and how it affects operations):

my @array = grep ...;
my $count = grep ...;

You can make your own subroutines that work just like grep. The prototype character & tells perl to expect a subroutine reference. However, it can only be the first argument. To try it, define a subroutine that takes a single argument, a code reference:

sub run_it (&) {
	my $sub = shift;
	$sub->();
	}

You can call that subroutine in several ways. You can use a block, the sub keyword with a block, or a reference to a subroutine, or a reference to a named subroutine:

use v5.10;

sub named { say "I have a name" }

my $result = run_it { say "Hello!" };
   $result = run_it sub { say "I have a keyword!" };
   $result = run_it \&named;

However, perl is not smart enough to recognize other forms. It won’t like a scalar variable that might have a code reference later, and it can’t take a bareword that is the name of a defined subroutine (like sort will). These are compile-time errors:

use v5.10;

sub named { say "I have a name" }
my $code_ref = \&named;

my $result = run_it $code_ref;
   $result = run_it named;
   $result = run_it &named;

Handling grep’s second argument

Now you can take a code reference as an argument. The next part of the grep syntax is the the input. You could use the @ character to denote a list of arguments (not an array argument), and that appears to work:

use v5.10;
use warnings;

sub do_with_array (&@) {
	my( $sub, @args ) = @_;
	my @output;

	foreach my $elem ( @args ) {
		local $_ = $elem;
		push @output, $sub->();
		}
	return @output;
}

sub other_cats { qw(Ellie Ginger) }

my @cats = qw(Buster Mimi Roscoe);

@result = do_with_array { say $_ } @cats;
@result = do_with_array { say $_ } qw(Buster Mimi Roscoe);
@result = do_with_array { say $_ } 1 .. 10;
@result = do_with_array { say $_ } other_cats();

Like grep which can alias $_ to the original data, you can also means that you can change the original data with your subroutine argument if you use @_ (Item 114. Know when arrays are modified in a loop):

use v5.10;
use warnings;

sub do_with_array (&@) {
   my $sub = shift;
   my @output;
   local $_;

   foreach ( @_ ) {
	   push @output, $sub->();
   }
   return @output;
}

my @original   = qw(1 2 3);
my @new        = do_with_array { $_ += 2 } @original;
say "new = @new";             # 3 4 5
say "original = @original";  # 3 4 5

You might try the \@ prototype, but that limits you in other ways. Now perl expects a named array as an argument. You cannot use an array reference, range, literal list, or the return values from a subroutine call. It’s a named array or an error. That’s no good.

Likewise, you might use the + prototype introduced in Perl 5.14. This allows you to use an array or an array reference argument. Perl doesn’t complain if you use a range or a subroutine call, but it also doesn’t do the right thing:

use v5.10;
use warnings;

sub do_with_array (&+) {
	my( $sub, $array ) = @_;
	my @output;

	foreach my $elem ( @$array ) {
		local $_ = $elem;
		push @output, $sub->();
		}
	return @output;
}

sub other_cats { qw(Ellie Ginger) }

my @cats = qw(Buster Mimi Roscoe);

@result = do_with_array { say $_ } @cats;
@result = do_with_array { say $_ } [ 'a' .. 'g' ];
@result = do_with_array { say $_ } 1 .. 10;
@result = do_with_array { say $_ } other_cats();

The named subroutine and the array reference work just fine. Perl does something weird with the range, and the subroutine call appears to not happen at all:

Buster
Mimi
Roscoe
a
b
c
d
e
f
g
Use of uninitialized value $. in range (or flip) at run_it.pl line 21.

Not only that, but the + prototype character also allows named hashes and hash references:

use v5.10;
use warnings;

sub do_with_array (&+) {
	my( $sub, $array ) = @_;
	my @output;

	foreach my $elem ( @$array ) {
		local $_ = $elem;
		push @output, $sub->();
		}
	return @output;
}

my %cats = qw(Buster Mimi Roscoe Ellie);

@result = do_with_array { say $_ } %cats;
@result = do_with_array { say $_ } { 'a' => 'b' };

You don’t get an error until runtime when you try the array dereference:

Not an ARRAY reference at run_it.pl line 8.

You can check these things at runtime, though. This reminds you, as we said in Understand why you probably don’t need prototypes, that prototypes probably don’t do what you think. You end up doing a lot of the work that most people think prototypes do for you:

use v5.10;
use warnings;
use Carp;

sub do_with_array (&+) {
	my( $sub, $array ) = @_;
	croak "do_with_array takes an array argument"
		unless ref $array eq ref [];

	my @output;

	foreach my $elem ( @$array ) {
		local $_ = $elem;
		push @output, $sub->();
		}

	return @output;
}

my @cats = qw(Buster Mimi Roscoe Ellie);
my %cats = map { $_, 1 } @cats;

@result = do_with_array { say $_ } @cats;
@result = do_with_array { say $_ } %cats;

Handling context

We explained context in Item 12. Understand context and how it affects operations, and if you want to emulate grep you have to handle them. In list context grep returns a list, in scalar context it returns a count, and in void context it potentially does nothing. You really only need to handle the void case. In this case, you’ll simply return without doing anything:

use v5.10;
use warnings;
use Carp;

sub do_with_array (&+) {
	return unless defined wantarray;
	my( $sub, $array ) = @_;
	croak "do_with_array takes an array argument"
		unless ref $array eq ref [];

	my @output;

	foreach my $elem ( @$array ) {
		local $_ = $elem;
		push @output, $sub->();
		}

	return @output;
}

The list and scalar contexts come from returning an array. When you return a named array, you get the same results as assigning an array. In list context you get the list elements, and in scalar context you get the count. If you want to return something different, such as a list instead of a named array, you have to do more work:

	return wantarray ? qw( a b c ) : 3;

So, once again, prototypes half solve the problem, but leave you with more work to do.

Things to remember

  • Use the & prototype character to specify a code reference argument
  • If the code reference argument argument is the first argument, you can leave off the sub keyword
  • The reference has to be a block of code or a reference to a named subroutine, and specifically not a scalar variable

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Profile with Devel::NYTProf

Profile before you decide where to optimize—you might be surprised where you’re losing all of your performance.

We won’t go into all the details of profiling in this Item, but you can read about those in Mastering Perl. In short, profilers count something then report the results. They can track any of the things that you might care about. The Devel::NYTProf module, like most profilers, tracks time, counting the statements you run and how long they take.

Basic profiling

Perl profilers are really debuggers, using the internal hooks to detect movement in and out of statements. As such, you can load it with the -d switch just like any other debugger:

% perl -d:NYTProf some_program

You can also load it as a module and NYTProf will figure it out:

% perl -MDevel::NYTProf some_program

Either way, you can set the NYTProf environment variable to specify various settings (which you’ll have to read the documentation). For instance, to start profiling after the compilation phase, you use the start field:

% export NYTPROF=start=init
% perl -MDevel::NYTProf some_program

That’s it. Your program will run slower as NYTProf watches it and records that data, so don’t worry about the wallclock time. If your program is already agonizingly slow, it’s going to seem worse. Perhaps not a lot worse, but you’ll probably notice the difference. Maybe it’s time to get a cup of coffee.

The reports

NYTProf stores its data in the ./nytprof.out file (although you can change that). That’s just the data, though. To look at it, you need to convert it to a particular format. The distribution comes with several programs to do that already. The easiest and most popular might be HTML:

% nytprofhtml
Reading nytprof.out
Processing nytprof.out data
Writing sub reports to nytprof directory
 100% ...
Writing block reports to nytprof directory
 100% ...
Writing line reports to nytprof directory
 100% ...

Inside the nytprof/ directory, you’ll find index.html, which presents the summary of the profile:

Here’s a full profile from the Pod::Perldoc module looking through perlfunc.

Call graphs

NYTProf also generates call graphs that you can inspect with Graphviz.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Use lookarounds to split to avoid special cases

There are some regular expression tricks that can help you deal with balanced delimiters in a string. The split command takes a pattern, removes the parts of a string that match that pattern, and give you a list of the parts of the string between those separators. Said another way, split works when the parts you don’t need are between the values.

Single character separators are easy:

use v5.10;

my @letters = split /:/, 'a:b:c:d:e';
say "@letters";

The list comes out just as you expect:

a b c d e

Even multiple or variable width patterns are fine:

use v5.10;

my @cats = split /\s+/, 'Buster
	Mimi     Roscoe';
say "@cats";

The list comes out just as you expect:

Buster Mimi Roscoe

It gets more tricky when you have balanced delimiters, when there’s something that marks the start and the end of a value. The problem is that there is something in front of the first element and something after the last element. You can’t split on the pattern of characters between the values because you don’t remove everything:

use v5.10;

my @cats = split /></, '<Buster><Mimi><Roscoe>';
say "@cats";

The first and last delimiter characters are still attached to their values:

<Buster Mimi Roscoe>

You might be tempted to live with that and process those values after the split:

use v5.10;

my @cats = split /></, '<Buster><Mimi><Roscoe>';
$cats[0] =~ s/<//;
$cats[-1] =~ s/>//;
say "@cats";

Some people might be satisfied with that, and it does work, but it’s much better to remove the special cases. If you limit yourself to matching just the character that you want to remove, you’re a bit limited. One problem is the empty leading field that you get if you try to match the first delimiter character:

use v5.10;

my @cats = split /><|\A<|>\z/, '<Buster><Mimi><Roscoe>';

say "@cats";

There’s a space at the beginning of the output because there’s an empty leading field, but the list at least doesn’t have any of the delimiter characters:

 Buster Mimi Roscoe

To fix this, you still need to handle the leading field, perhaps by shifting it off. Again, this works, even if it’s unsightly:

use v5.10;

my @cats = split /><|\A<|>\z/, '<Buster><Mimi><Roscoe>';
shift @cats;

say "@cats";

The special processing isn’t as bad, but you have to remember to handle that one element.

Instead of matching characters, you can use lookarounds to split on the the middle of the balanced delimiter by using a zero-width assertion. The lookarounds match a condition in the string but do not consume any characters. These are conditions in the string, not characters to match.

If you use a lookbehind next to a lookahead, you can split on the position in the string where both conditions match. You want to match in the middle of a >< so the > ends up with the preceding element and the < stays with the succeeding element.

The positive lookbehind has the general form (?<=PATTERN). That pattern, which must be fixed-width, must match before the position. In this case, you want to match a > before the position, so the assertion is (?<=>).

The positive lookahead is almost the same thing, with the form (?=PATTERN). You want to match a < after the position, so your assertion is (?=<).

Putting them together, the lookbehind next to the lookahead, splits the values:

use v5.10;

my @cats = split /(?<=>)(?=<)/, '<Buster><Mimi><Roscoe>';

say "@cats";

The output list still has the delimiter characters, but now each element needs the same processing, so there are no special cases:

<Buster> <Mimi> <Roscoe>

Once you have the values in their own elements, you can remove the delimiters:

use v5.14;

my @cats =
	map { s/\A<|>\z//rg }    # return the modified value
	split /(?<=>)(?=<)/,
	'<Buster><Mimi><Roscoe>';

say "@cats";

That might seem a bit silly, but we’re only using a simple example to illustrate the point.

Consider a slightly more complicated case, where the fields are quoted, but then separated by commas. Unless your learning to re-invent the wheel (a valid exercise to sharpen your skills), you should probably use a module (Item 115. Don’t use regular expressions for comma-separated values). For this example, you’ll do it yourself:

use v5.10;

my @cats =
	split /(?<="),(?=")/,
	'"Buster","Mimi","Roscoe"';

say "@cats";

This removes the commas, as long as they are between quotes. However, you leave the quotes in place so you don't treat the first and last values specially:

"Buster" "Mimi" "Roscoe"

To get rid of the quotes, you process each item in the same way:

use v5.14;

my @cats =
	map { s/\A"|"\z//rg }       # return the modified value
	split /(?<="),(?=")/,
	'"Buster","Mimi","Roscoe"';

say "@cats";

You might try to construct a more complicated regular expression to also remove the quotes, but that's going to be harder to read and maintain than doing it in two simple steps.

Things to remember

  • You don't have to remove delimiters in one step
  • You can use a lookbehind next to a lookahead to specify a position in a string

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Understand why you probably don’t need prototypes

You should understand how Perl’s prototypes work, not so you’ll use them but so you won’t be tempted to use them. Although prototypes can solve some problems, they don’t solve the problems most people want.

Some languages, such as C, have function prototypes. You tell your function how many arguments it has and what sort they are, as well as what type of thing it returns:

char* some_function( int start, int length );

Java has a method signature:

public class SomeClass {
   public String mySubstr( String aString, int i, int j ) {
     ...
   }
 }

Aside from syntax-warping modules, such as Devel::Declare, or source filters, such as Filter::Simple, Perl doesn’t have that as a fundamental feature. It’s subroutines and methods take lists and return lists (or a single item).

Perl does have prototypes as a compile-time aid, documented in perlsub. It’s not there to ensure you give a subroutines particular sorts of arguments but to help the compiler figure out what you typed and how you want it to interpret it. Perl doesn’t require you to surround your arguments in parentheses, so prototypes gives you a way to tell the compiler where the arguments start and end. Consider these examples, which you’ll understand by the end of this Item:

use utf8;

my $value = ? +1;
my @array = ( sin ?*5/2, cos ?, 1, 2, 3 );

A lesser part of that includes the specification of Perl core types (scalar, array, hash, subroutine, or globs). Method calls, which require the parentheses to surround their arguments, completely ignore prototypes because Perl doesn’t need any help to parse them.

You (optionally) specify the prototype after the subroutine name. The simplest prototype is the empty prototype, meaning the subroutine takes no arguments:

use utf8;

sub TRUE  () { 1 }
sub FALSE () { 0 }
sub ?     () { 3.1415926 }

The simplest prototype

The empty prototype tells perl not to consume any arguments when it sees that function name. How does perl know how to interpret this?

use utf8;

say ? +1;

Since ? is a subroutine, it can take arguments. Since you wrote it without parentheses, perl needs a hint to parse that. It could be two forms, each with possibly different answers:

use utf8;

say ?( +1 );
say ?() + 1;

The empty prototype tells perl to parse it as ?() + 1. This makes the empty prototype a way that you can declare constants.

This means, however, that perl needs to know about the prototype before it parses that bit of code. This works because the prototype shows up first because the subroutine is completely defined before it’s called:

use utf8;

sub ? () { 3.1415926 }
say ? +1;   # 4.1415926

You don’t need to define the subroutine ahead of time, but you have to declare its prototype to get the behavior that you expect:

use utf8;

sub ? ();
say ? +1;   # 4.1415926

BEGIN {
*? = sub { 3.1415926 }
}

This isn’t a recommendation to write code like this, but it illustrates the point. Even with warnings turned off, you’ll still get a warning about the mismatch in prototypes:

use v5.10;
use utf8;

sub ? ();
say ? +1;   # 4.1415926

BEGIN {
*? = sub { 3.1415926 }
}

You can make the prototypes in the forward definition and the full definitions match:

use v5.10;
use utf8;

sub ? ();
say ? +1;   # 4.1415926

BEGIN {
	*? = sub (){ 3.1415926 }
	}

The prototype matters only for the calls to the subroutines after its definition. The prototype doesn’t matter for subroutine calls before its definition. However, to use the subroutine as a bareword, you still have to have a forward declaration:

use v5.10;
use utf8;

sub ?;
say ? +1;   # 3.1415926

sub ? ();
say ? +1;   # 4.1415926

BEGIN {
	*? = sub (){ 3.1415926 }
	}

Not only that, we can change the prototype as perl parses your program.

use v5.10;
use utf8;

sub ?;
say ? +1;

sub ? ();
say ? +1;   # 4.1415926

sub ? ($$);
say ? +1;   

BEGIN {
	*? = sub (){ 3.1415926 }
	}

The sub ? ($$) tells perl to expect two arguments for the subsequent calls. Since you don’t give it enough arguments

Prototype mismatch: sub main::? () vs ($$) at proto.pl line 10.
Not enough arguments for main::? at proto.pl line 11, near "1;"
BEGIN not safe after errors--compilation aborted at proto.pl line 15.

More than zero arguments

To take more or more arguments, you just specify that number of items in the prototype. So far, you’ve only seen scalar arguments, which you specify as a $ in the prototype:

sub twofer    ($$);    # exactly two arguments
sub hat_trick ($$$);   # exactly three arguments

This does not mean that the subroutine gets that number of arguments. It does not mean that it takes two scalar variables as arguments. Perl parses each argument in scalar context:

use v5.10;

sub twofer ($$) { say "@_" };

my @array = qw( Buster Mimi Roscoe );
twofer @array, 2;

my %hash = map { $_ => 1 } 'a' .. 'z';
twofer %hash, 2;

The @array is a single argument, the first one, and is taken in scalar context, giving the number of elements in it. The %hash is treated in the same way, providing the mostly useless “hash statistics” scalar value:

3 2
19/32 2

Putting a \ in front of a prototype character specifies that the argument is a named variable. Instead of the value, you get a reference to the value:

use v5.10;

sub twofer (\$) { say "@_" };

my $scalar = 'Buster';
twofer $scalar;

The output shows the reference:

SCALAR(0x10082e548)

If you try to give it a non-variable, perl complains at compile-time:

Type of arg 1 to main::twofer must be scalar (not constant item) at proto.pl line 6, near "2;"
Execution of proto.pl aborted due to compilation errors.

A non-backslashed @ in a prototype specifies a list and forces list context on the rest of the arguments. It does not require and array argument:

use v5.10;

sub twofer (@) { say "@_" };

my @array = qw( Buster Mimi Roscoe );
twofer @array, 2;

my %hash = map { $_ => 1 } 'a' .. 'z';
twofer %hash, 2;

The output shows the normal list flattening behavior you expect from a Perl subroutine call. Notice that it does not care about the number or type of arguments:

Buster Mimi Roscoe 2
w 1 r 1 a 1 x 1 d 1 j 1 y 1 u 1 k 1 h 1 g 1 f 1 t 1 i 1 e 1 n 1 v 1 m 1 s 1 l 1 c 1 p 1 q 1 b 1 z 1 o 1 2

If you wanted to keep the array together, you would put a backslash in front of the \@. The argument must be a named array, and not an anonymous array or a reference to an array. Even though the argument is an array, the value in @_ will be a reference to that array:

use v5.10;

sub twofer (\@$) { say "@_" };

my @array = qw( Buster Mimi Roscoe );
twofer @array, 2;

The output shows two arguments:

ARRAY(0x100827810) 2

You can’t then sneak in a scalar variable or a hash variable:

use v5.10;

sub twofer (\@$) { say "@_" };

my @array = qw( Buster Mimi Roscoe );
twofer @array, 2;

my %hash = map { $_ => 1 } 'a' .. 'z';
twofer %hash, 2;

perl catches that at compile-time:

Type of arg 1 to main::twofer must be array (not private hash) at proto.pl line 9, near "2;"
Execution of proto.pl aborted due to compilation errors.

If you want to take more than one type of argument at a particular position, you can specify the possible types in brackets. To take either an array or a hash, you use [@%]:

use v5.10;

sub twofer (\[@%]$) { say "@_" };

my @array = qw( Buster Mimi Roscoe );
twofer @array, 2;

my %hash = map { $_ => 1 } 'a' .. 'z';
twofer %hash, 2;

Now the output takes either:

ARRAY(0x100827810) 2
HASH(0x10082e0c8) 2

If you want to take two separate arrays,

use v5.10;

sub twofer (\@\@) { say "@_" };

my @array1 = qw( Buster Mimi Roscoe );
my @array2 = qw( Ginger Ellie );
twofer @array1, @array2;

You get one reference for each array:

ARRAY(0x100827810) ARRAY(0x10082dff0)

Even though you can specify the variable type with the backslashed form, you can’t specify anything about the values that they hold, or limits to the number of elements they contain.

You can also specify prototypes for subroutines and globs, which we’ll cover in a separate Item since you can have a lot more fun with those.

Optional arguments

So far, you’ve used prototypes that specify an exact number of elements. If you want to specify optional arguments, you can divide the mandatory and optional prototype characters with a semicolon. If you wanted to take at least two but possible three arguments, you’d use the prototype ($$;$)

use v5.10;

sub hat_trick ($$;$) { say "@_" };

hat_trick 'Buster', 'Mimi';
hat_trick 'Buster', 'Mimi', 'Roscoe';

Both of those work just fine, but if you try to give it four arguments, you get a compilation error telling you that there are “Too many arguments”.

If you want a minimum number of arguments, but no maximum, you can use the @ as the optional argument. All of these are fine:

use v5.10;

sub hat_trick ($$;@) { say "@_" };

hat_trick 'Buster', 'Mimi';
hat_trick 'Buster', 'Mimi', 'Roscoe';
hat_trick 'Buster', 'Mimi', 'Roscoe', 'Ginger';

Here are some interesting prototypes from List::MoreUtils:

# from List::MoreUtils
sub each_array (\@;\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@)

sub natatime ($@)

sub mesh (\@\@;\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@\@)

A final warning

Subroutine prototypes exist chiefly so perl can parse calls to your subroutines just like it would its built-ins—without parentheses. They can set the context for the arguments or the variable types, but they can’t specify the sorts of values. Prototypes aren’t the tools that you want if any of those are your goal. It’s also easier to just use parentheses to mark your argument list.

Things to remember

  • Prototypes are not function signatures
  • Non-backslashed prototype characters enforce a context, not a type
  • Backslashed prototype characters enforce a variable type
  • You can specify optional arguments after a semicolon in the prototype.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Return error objects instead of throwing exceptions

Programmers generally consider two types of error communication: the “modern” and shiny exception throwing, and the old and decrepit return values. When they consider these, they choose one and forsake the other. One is good, and the other is bad. Programmers won’t agree on which is which though.

The return value technique comes from older languages that had no other convenient way to do it:

my $result = some_function( @args );
if( $result ) {
	...     # handle error
	}
...         # continue with program

This has a problem because the values that you want to return for normal operation get mixed up with those that you want to use to signal the error. You could add some buffer argument to fill in the error, but that’s really annoying, especially since Perl doesn’t have function signatures:

my $value = some_function( \$error, @args );
if( $$error ) {
	...     # handle error
	}
...         # continue with program

To get around this, some languages created a third path for error messages. An exception signals a problem, jumps out of the current code context, and hopes that someone handles it (Item 101: Use die to generate an exception). In Perl, you can use an eval to trap this then inspect the $@ variable for the error (Know the two different forms of eval):

my $value = eval { failing_sub( @args ) };
if( my $error = $@ ) {
	...
	}
...         # continue with program

Actually handling an error from an eval is tricky, so some people recommend using something like Try::Tiny, even though behind its interface its doing essentially the same thing:

use Try::Tiny;

try { failing_sub( @args ) }
	catch { ... }    # handle error
	finally { ... }; #fail over

This still isn’t much better, even if it does its best to handle the trickiness of $@. When you compare it to the other examples you’ve seen so far in this Item, you can’t really tell the difference at the syntax level. You call something, then add code to check a value. Exceptions, as an ideal, might have merit. If your language started with them as a core concept (which Perl did not), you probably have the flexibility you need to use them effectively. A language with exception handlers can both handle the error and pick up at the point of the error. In Perl, once you get the exception, you don’t have any way to get back to the spot where you threw the exception. Essentially, you just have a fancy return value. Put a bit more strongly, you have a crude goto that doesn’t even preserve its context.

Most people have failed to consider something Perly instead. The return value and buffer examples are hold-overs from C-like thinking, and the exceptions are object-oriented envy. Or, more correctly, envy for particular implementations of an object-oriented concept.

Since Perl does not have subroutine signatures, you don’t get to declare what you will give to a subroutine or what you get back from a subroutine. You pass it a list, and you get back a list (even if that is one or no items). In a C-like language, you’d return only one kind of thing, which made the return result a problem. You could return a particular type of struct, but that struct has to be the same type for success and failure. You might be able to force that to work, but then every subroutine returns the same struct and you have to translate that into the right values to pass to other routines. Ugh.

Perl doesn’t care what you return, so why not return an error object in case of an error, and anything else otherwise? The fundamental feature of an object is identity�an object knows what it is. Unless you get an error object, everything worked. When you get the result, you could look for objects of the right type:

use Scalar::Util qw(blessed);

my $results = some_function( @args );
if( blessed($result[0]) && $result[0]->isa( 'MyError' ) && $result[0]->is_error } ) {
	...     # handle error
	}
...         # continue with program

It’s easier, though, just to assume it’s an object and call the is_error method. If it’s not an object, you just catch the method call on the non-object with an eval, that you don’t need to trap. This also lets you use any object that has the is_error interface:

my @results = some_function( @args );
if( eval{ $result[0]->is_error } ) {
	...     # handle error
	}
...         # continue with program

That error object could get fancy, too, with a given-when (although with a for, as in Use for() instead of given()):

my @results = some_function( @args );
if( eval{ $result[0]->is_error } ) {
	for ( $result[0]->type ) {
		when( 'output' )       { ... }
		when( 'no_database' )  { ... }
		when( 'bad_request' )  { ... }
		default                { ... }
		}
	}

You haven’t seen anything about the error object though, mostly because it doesn’t matter. Indeed, this particular interface doesn’t matter. You don’t need anything fancy. The error class just carries some data around. It doesn’t do anything with the data and it doesn’t interrupt your flow control:

package Local::MyError {

	sub new {
		my( $class, $type, $message ) = @_;

		bless {
			message => $message,
			type    => $type,
			caller  => [ caller(1) ],
			};
		}

	sub is_error { 1 }
	sub type     { $_[0]->{type} }
	}

When you need to communicate a failure, you return the error object:

sub some_function {
	...;
	open my $fh, '>', $filename or return Local::MyError->new( ... );
	...;
	}

Such a class isn’t limited to this particular technique either, so you can get more use out of it. If you still want to use exceptions, you can use the error object with die:

sub some_function {
	...;
	open my $fh, '>', $filename or die Local::MyError->new( ... );
	...;
	}

my @results = eval { some_function( @args ) };
if( my $error = $@ and eval { $error->is_error } ) {
	for ( $error->type ) {
		when( 'output' )       { ... }
		when( 'no_database' )  { ... }
		when( 'bad_request' )  { ... }
		default                { ... }
		}
	}

This is very similar to the example in the autodie documentation:

eval {
	use autodie;

	open(my $fh, '<', $some_file);

	my @records = <$fh>;

	# Do things with @records...

	close($fh);

};

given ($@) {
	when (undef)   { say "No error";                    }
	when ('open')  { say "Error from open";             }
	when (':io')   { say "Non?open, IO error.";         }
	when (':all')  { say "All other autodie errors."    }
	default        { say "Not an autodie error at all." }
}

Things to remember

  • Exceptions aren’t that different than return values
  • You can’t resume execution after throwing an exception
  • You can return an error object to signal failure

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

A Chinese translation of Effective Perl Programming

I mentioned a long time ago that a Chinese translation of Effective Perl Programming was in the works, and apparently it’s done. Someone sent me a copy of the Chinese version of the book. I can’t tell you who did it (if it’s you, let me know) and I don’t know where you can buy it (if you know, let me know). Also, I don’t know what I want to do with the copy that I have. I don’t read Chinese, so I can’t really read the book to see how well it translates, and I don’t want to keep the book as a trophy. Does someone else want the book? Is there a Chinese Perl event that would like to give it away as a prize? I’ll get Josh and I to sign it and send it along.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Use lookarounds to eliminate special cases in split

The split built-in takes a string and turns it into a list, discarding the separators that you specify as a pattern. This is easy when the separator is simple, but seems hard if the separator gets more tricky.

For a simple example, you can split an entry from /etc/password (although getpw* functions will do that for you):

root:*:0:0:System Administrator:/var/root:/bin/sh

The colons separate the fields, so you split on a colon:

my @fields = split /:/, $passwd_line;

That works just fine because the separator is a single character, that character is the same between each field, and the separator character doesn’t appear in any of the data.

A slightly more tricky example has a character from the separator also show up in the data. Consider comma-separated values which also allows a comma in the data. If you really have to do this, you would use a module (Item 115. Don’t use regular expressions for comma-separated values). However, this is a good task to illustrate some of the tricks in this Item. You might see these data stored in many ways. You are likely to see all the fields quoted if any one of them has the comma:

"Buster","Roscoe, Cat","Mimi"

You can split on ",", which separates all the fields:

my $string = q("Buster","Roscoe, Cat","Mimi");

my @fields = split /","/, $string;

$" = "\n";
print "@fields\n";

However, the first and last fields have remnants of the quoting:

"Buster
Roscoe, Cat
Mimi"

In this case, the simple split failed because it only removes text between the fields and doesn’t care at all about text at the beginning of the string or the end of the string.

You might think that you can make special cases to handle the beginning and end of the string bits. Creating special cases is almost always what you want to avoid: they make the code more complicated and they make you think about more than you really need to think about. Still, you can do that with alternations in the pattern:

my $string = q("Buster","Roscoe, Cat","Mimi");

my @fields = split /\A"|","|"\z/, $string;

$" = "\n";
print "@fields\n";

And, it doesn’t work. The split maintains leading open fields, so we get an extra field at the start:


Buster
Roscoe, Cat
Mimi

You could handle that by removing the first element, but that’s more duct tape and spit over the other kludge. Not only do you have two special cases in the pattern, but you have a special case in the output.

You don’t have to remove the quotes right away though. You can reduce all the special cases by not matching the quote characters in the split pattern. You can use a lookaround to find the commas surrounded by quotes:

my $string = q("Buster","Roscoe, Cat","Mimi");

my @fields = split /(?<="),(?=")/, $string;

$" = "\n";
print "@fields\n";

The positive lookbehind, (?<=...), is a zero-width assertion. It matches a pattern that exists (hence positive) but doesn't consume the characters it matches. You already know about other zero-width assertions, such as \b and ^. These merely match a condition in the string before the pattern. The positive lookahead, (?<=...), is the same thing, but looks forward of the pattern.

Now all of the fields retain their quotes because the lookarounds do not consume the characters they match, even though they assert those characters must be there:

"Buster"
"Roscoe, Cat"
"Mimi"

You can easily strip off the quotes, handling every element returned by split in the same way:

use v5.14;
my $string = q("Buster","Roscoe, Cat","Mimi");

my @fields =
	map { s/\A"|"\Z//gr }
	split /(?<="),(?=")/, $string;

$" = "\n";
print "@fields\n";

The pattern has no special cases, and the output from split has no special cases. Eliminating special cases reduces the number of things you have to remember and the reduces the likelihood that you'll mess up one of the cases.

Buster
Roscoe, Cat
Mimi

What if the separator where even more complex, with a literal quote mark inside the data? If you can do that, you can imagine a quote character next to a comma in the field:

"Buster","Roscoe "","" Cat","Mimi"

Now you want to split on a comma with quotes around it, but only if it doesn't have two consecutive quotes on either side. You can combine the positive lookarounds with negative lookarounds. The negative versions act the same, but assert that the condition cannot match, just like a \B asserts that the position is not a word boundary:

use v5.14;
my $string = q("Buster","Roscoe "","" Cat","Mimi");

my @fields =
	map { s/"(?=")//gr }
	map { s/\A"|"\z//gr }
	split /(?<!"")(?<="),(?=")(?!"")/, $string;

$" = "\n";
print "@fields\n";

In processing the "", you use another positive lookahead to unescape the doubled double quote character:

Buster
Roscoe "," Cat
Mimi

As a final example, instead of quoted fields, you might see the non-separator comma as an escaped character:

Buster,Roscoe\, Cat,Mimi

In this case, you only want to split on a comma that does not have an escape character before it. You can't use a positive lookbehind because you don't want to match characters before the comma. Instead, you want a negative lookbehind because you want to assert that there are characters that can't appear before the comma. Instead of a =, you use a !:

use v5.14;
my $string = q(Buster,Roscoe\\, Cat,Mimi);

my @fields =
	map { s/\\(?=,)//gr }
	split /(?<!\\),/, $string;

$" = "\n";
print "@fields\n";

Again, you use another positive lookahead, (?=,), in the s/// so you substitution pattern does not match the character that you don't want to replace. Otherwise, you'd have to type the comma twice:

s/\\,/,/gr

You can go even further with these examples, creating much more ugly and complex examples with additional levels of quoting. This should naturally lead you to believe that regular expressions aren't the best tool for this (or at least a single regular expression).

Things to remember

  • If you really have to parse comma-separated values, use a module instead of writing your own patterns
  • Lookarounds assert a condition in the string without consuming any characters
  • The positive lookarounds assert their patterns must match
  • The negative lookarounds assert their pattern must not match
  • Use the lookarounds to eliminate special cases in complex split patterns

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Enchant closures for better debugging output

When you’re using code references heavily, you’re going to have a problem figuring out which one of them is having a problem. You define them in possibly several and far-flung parts of your program, but when it comes to using them, you don’t know which one you are using. You can’t really print its value like you would for a scalar, making it more difficult for you to debug things. You can dereference a scalar or an array to see what it is, but you can’t dereference a code reference without making it do something.

Consider this code, which defines a code reference along with other
references (also see Item 59. Compare reference types to prototypes). You can print the values in most reference types to see what they are, but you can’t do that directly with the code reference:

use v5.10;

my @array = ( \'xyz', [qw(a b c)], sub { say 'Buster' } );

foreach ( @array ) {
	say "$_";
	when( ref eq ref \ ''   ) { say "Scalar $$_" }
	when( ref eq ref []     ) { say "Array @$_" }
	when( ref eq ref sub {} ) { say "Sub ???" }
	}

When you dereference a value, nothing happens (aside from any tie magic). When you dereference a subroutine, you run its code with whatever arguments you give to it. If the subroutine needs arguments, which arguments would you use if you wanted to see what the subroutine would do?

There’s a clever way around this, first noted by Randal Schwartz in his Perlmonks post Track the filename/line number of an anonymous coderef. He proposed a new subroutine, main::Sub to use in place of the sub keyword.

BEGIN {
  package MagicalCoderef;

  use overload '""' => sub {
    require B;

    my $ref = shift;
    my $gv = B::svref_2object($ref)->GV;
    sprintf "%s:%d", $gv->FILE, $gv->LINE;
  };

  sub main::Sub (&) {
    return bless shift, __PACKAGE__;
  }
}

This technique actually made the code reference an object so he could overload stringification.

my $s = Sub { say +shift };
print "$s\n";

This stringified the code reference as the filename and line number where you created it:

/Users/Buster/Desktop/magic_coderef.pl:19

You can go farther than this, though, and make this a bit more useful. In response to Randal’s post, I suggested turning the idea inside-out. Instead of using Sub, I exposed the object creation. That way, you don’t have to worry about which package Sub might be in:

use v5.14;

package MagicalCodeRef 0.90 {
    use overload '""' => sub
        {
        require B;

        my $ref = shift;
        my $gv = B::svref_2object($ref)->GV;
        sprintf "%s:%d", $gv->FILE, $gv->LINE;
        };

    sub enchant { bless $_[1], $_[0] }
    }

You can apply this magic to code references that you already have to get the same result (you could also do this with Sub, but it looks odd):

my $s = MagicalCodeRef->enchant( sub { say +shift } );
print "$s\n";

Still, that’s not good enough. You where where you created the subroutine, but that might not be enough information for you. You can use even more B magic. The B::Deparse module can decompile code to show you what perl thinks it is (we used this briefly in Item 7. Know which values are false and test them accordingly).

use v5.14;

package MagicalCodeRef 1.00 {
    use overload '""' => sub
        {
        require B;

        my $ref = shift;
        my $gv = B::svref_2object($ref)->GV;

		require B::Deparse;
		my $deparse = B::Deparse->new;
		my $code = $deparse->coderef2text($ref);

        my $string = sprintf "---code ref---\n%s:%d\n%s\n---",
        $gv->FILE, $gv->LINE, $code;

        };

    sub enchant { bless $_[1], $_[0] }
    }

With the same bit of code, you get additional output:

my $s = MagicalCodeRef->enchant( sub { say +shift } );
print "$s\n";

The output shows everything that perl thinks it needs to reproduce that behavior, including some pragma settings and compiler hints:

---code ref---
/Users/brian/Desktop/magic:25
{
    use strict 'refs';
    BEGIN {
        $^H{'feature_unicode'} = q(1);
        $^H{'feature_say'} = q(1);
        $^H{'feature_state'} = q(1);
        $^H{'feature_switch'} = q(1);
    }
    print shift();
}
---

The code doesn’t look the same as the code reference you initially created, but at least you have an idea what the code reference does.

If the code reference is a closure, you might also need to know which variables it closed over and what their values are. You can get these from the PadWalker module (which doesn’t come with Perl so you’ll need to get it from CPAN):

use v5.14;

package MagicalCodeRef 1.01 {
    use overload '""' => sub
        {
        require B;

        my $ref = shift;
        my $gv = B::svref_2object($ref)->GV;

		require B::Deparse;
		my $deparse = B::Deparse->new;
		my $code = $deparse->coderef2text($ref);

        require PadWalker;
        my $hash = PadWalker::closed_over( $ref );

		require Data::Dumper;
		local $Data::Dumper::Terse = 1;
        my $string = sprintf "---code ref---\n%s:%d\n%s\n---\n%s---",
        $gv->FILE, $gv->LINE,
        $code,
        Data::Dumper::Dumper( $hash );

        };

    sub enchant { bless $_[1], $_[0] }
    }

Give this new version of MagicalCodeRef a closure:

my $sub = do {
	my( $x, $y ) = qw( Buster Mimi );

	sub { print "$x $y @_" }
	};

my $s = MagicalCodeRef->enchant( $sub );
say $s;

Now you see that which variables in the code reference refer to lexical variables that are out of scope instead of package or special variables. Only the lexical variables show up in the Dumper output:

---code ref---
/Users/brian/Desktop/magic:35
{
    use strict 'refs';
    BEGIN {
        $^H{'feature_unicode'} = q(1);
        $^H{'feature_say'} = q(1);
        $^H{'feature_state'} = q(1);
        $^H{'feature_switch'} = q(1);
    }
    print "$x $y @_";
}
---
{
  '$y' => \'Mimi',
  '$x' => \'Buster'
}
---

Things to remember

  • You can bless code references and overload their stringification to output what you like
  • The B module can tell you the filename and line number where you created the closure
  • The B::Deparse module can decompile a code reference
  • The PadWalker module can give you the closed-over
    lexical variables and their values.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit