Manage your Perl modules with git

In Item 110: Compile and install your own perls, you saw how to install multiple versions of perl and to maintain each of the installations separately. Doing something with one version of Perl doesn’t affect any of the other versions.

You can take that a step further. Within each installation, you can use a source control system to manage your Perl modules. In this post you’ll use git, which has the advantage that you don’t need a server.

First, install your perl into its own directory:

% ./Configure -des -Dprefix=/usr/local/perls/perl-5.10.1
% make test
% make install

Second, before you do anything else with your newly installed perl, put your new directory into source control:

% cd /usr/local/perls/perl-5.10.1
% git init
% git add .
% git commit -a -m "Initial installation of Perl 5.10.1"

You’re not quite done there, though. You’re on the master branch:

% git branch
* master

You want to keep at least one pristine branch that is the initial state of your perl installation. You can always come back to it:

% git checkout -b pristine
Switched to a new branch 'pristine'

Leave that branch alone and switch back to master:

% git checkout master
Switched to branch 'master'

From here you can do many things, but you probably want to consider the master branch your “stable” branch. You don’t want to commit anything to that branch until you know it works. When you install new modules, use a different branch until you know you want to keep them:

% git checkout -b unstable
Switched to a new branch 'unstable'
% cpan LWP::Simple
% git add .
% git commit -a -m "* Installed LWP::Simple"

After using your newly installed modules for awhile and deciding that it’s stable, merge your unstable with master. Once merged, switch back to the unstable branch to repeat the process:

% git checkout master
Switched to a new branch 'master'
% git merge unstable
% git checkout unstable

Anytime that you want to start working with a clean installation, you start at the pristine branch and make a new branch from there:

% git checkout pristine
% git checkout -b newbranch

If you aren’t tracking your perl in source control already, just tracking a master and unstable branch can give you an immediate benefit. However, you can take this idea a step further.

With just one perl installation, you can create multiple branches to try out different module installations. Instead of merging these branches, you keep them separate. When you want to test your application with a certain set of modules, you merely switch to that branch and run your tests. When you want to test against a different set, change branches again. That can be quite a bit simpler than managing multiple directories that you have to constantly add or remove from @INC.

Know what creates a scope

Scopes can be confusing. Perl 5 introduced lexical, or my, variables that are only visible in the scope in which you define them. To properly scope your variables, you need to know what can define a scope and what doesn’t.

You commonly see lexical variables for subroutine arguments, for instance:

sub foo {
    my( $self, @args ) = @_;
    ...;
    }

The variables $self and @args don’t exist outside of that subroutine (ignoring black magic with things such as PadWalker). Lexicals variables have limited effect and no action at a distance, making them invaluable for robust programming. Not only that, but since the lexical variable names only matter in their scope, you don’t have to know about all of the variables that you have already defined to choose variable names in your scope.

Before Perl 5, all variables were package variables (so, global). Perl 5 couldn’t just ignore all of the existing Perl 4 programs, so it ended up supporting both the global package variables and lexical variables. That can make things confusing if you don’t understand the difference.

First, you need to know what makes a scope. Most people can give you at least one answer: a block creates a scope. Blocks show up in the syntax of many of Perl’s commonly used features:

# a subroutine definition block, perhaps anonymous
sub foo { ... }
my $foo = sub { ... };

# blocks for control stuctures
foreach ( @array ) { ... }
while( $condition ) { ... }
if( $condition ) { ... }

# blocks related to functions:
my $result = do { ... };
my @transformed = map { ... } @input;
my @filtered = grep { ... } @input;

# blocks in regular expressions
m/(?{...})/

Sometimes you can create the lexical variable outside of the block even though it’s scoped to the block. You can declare the lexical variable in the the test for while or if (and cousins), or as the control variable you want to use with foreach:

foreach my $index ( 0 .. 5 ) {
	print "index: $index\n";
	}

while( my $line = <DATA> ) {
	print "line: $line";
	}

if( my $foo = 'abc' ) {
	print "foo is $foo\n";
	}

You don’t need a control structure or operator to use a block to define the scope. You can use a bare block to create a scope:

# bare blocks
{
my $cat = 'Buster';
...;
}

Most Perler’s could identify blocks as scope definers, but there’s another scope definer that many people miss. File this away for your job interview trivia: a file is a scope too. You can’t see lexical variables outside of the file in which you define them, even if you don’t explicitly create the scope with a block. It’s as if there is a virtual block around the entire file.

You can use the file scope to create private class variables. The methods you define in the same file can see the private variables, but code in other files, such as subclasses, can’t mess with them:

package Some::Class;

my $private = 0; # only visible in this file

sub some_method {
   ...; # can see $private
   }

If you want other parts of the program to get or set the value in this private variable despite its scope, you can provide accessor methods. This gives you a chance to head off any shenanigans before you allow someone to change the value:

package Some::Class;

my $private = 0;

sub get_private { $private }
sub set_private { $private = $_[1] }

Some people extend the idea of private class variables too far because they think that a package creates a scope. It doesn’t. A package merely defines the default package unless you explicitly specify one. Since lexical variables aren’t connected to packages, they don’t care want the current package is. If you change the package, even if it’s in another block:

package Some::Class;

my $n = 'Can you see me?';

{
package main;
# $n still visible here
}

package Some::Class::Subclass;

# $n still visible

There are some more tricks with scopes and what constitutes a scoped variable, but that’s a matter for a future Item.

Things to remember

  • Lexical variables are only visible in their scope.
  • A block defines a scope.
  • A file defines a scope.
  • A package does not define a scope.

Memory-map files instead of slurping them

The conventional wisdom for slurping a file into a Perl program is to actually load the file into a program. We showed some of these in Item 53: Consider different ways of reading from a stream.

There are several idioms for doing it, from doing it yourself:

my $text = do { local( @ARGV, $/ ) = $file; <> };

or using an optimized module such as File::Slurp.

use File::Slurp qw(read_file);

my $text = read_file( $file );

Given a large file, say, something that is 2 GB, you end up with a memory footprint that is at least the file size. This program to load a 2 GB file took 11 seconds to load the file on my Mac Pro. The memory footprint rose to 2.25 GB and stayed there even after $text went out of scope:

#!/usr/bin/perl
use strict;
use warnings;

print "I am $$\n";

use File::Slurp;

{
my $start = time;
my $text = read_file( $ARGV[0] );
my $loadtime = time - $start;
print "Loaded file in $loadtime seconds\n";

my $count = () = $text =~ /abc/;

print "Found $count occurances\n";
}

print "Press enter to continue...";

<STDIN>;

The problem is in the concept that you have to somehow capture and retain control of the data to make use of it.

To solve this, you should avoid the painful part. That is, don’t load the file at all. That I/O is really slow! You can memory-map, or mmap, the file. The name comes from the system call that makes it possible.

Instead of loading the file, you use mmap to make a connection between your address space and the file on the disk. You don’t have to worry about how this happens, but basically you use part of a disk file as if it was actually in memory. The advantage is that you don’t have the I/O overhead, so there is no load time, and since you don’t have to make space to hold the file in memory, you don’t pay a memory footprint.

This program use File::Map, you “load” the file instantly and it’s actual memory footprint was under 3 MB (three orders of magnitude less!):

#!/usr/bin/perl
use strict;
use warnings;

use File::Map qw(map_file);

print "I am $$\n";

{
my $start = time;
map_file my $map, $ARGV[0];
my $loadtime = time - $start;
print "Loaded file in $loadtime seconds\n";

my $count = () = $map =~ /abc/;

print "Found $count occurances\n";
}

<STDIN>;

The $map acts just like a normal Perl string, and you don’t have to worry about any of the mmap details. When the variable goes out of scope, the map is broken and your program doesn’t suffer from a large chunk of unused memory.

In Tim Bray’s Wide Finder contest to find the fatest way to process log files with “wider” rather than “faster” processors, the winning solution was a Perl implementation using mmap (although using the older Sys-Mmap). Perl had nothing special in that regard because most of the top solutions used mmap to avoid the I/O penalty.

The mmap is especially handy when you have to do this with several files at the same time (or even sequentially if Perl needs to find a chunk of contiguous memory). Since you don’t have the data in real memory, you can mmap as many files as you like and work with them simultaneously.

Also, since the data actually live on the disk, different programs running at the same time can share the data, including seeing the changes each program makes (although you have to work out the normal concurrency issues yourself). That is, mmap is a way to share memory.

The File::Map module can do much more too. It allows you to lock filehandles, and you can also synchronize access from threads in the same process.

If you don’t actually need the data in your program, don’t ever load it: mmap it instead.

What’s the difference between a list and an array?

I recently updated perlfaq4‘s answer to “What’s the difference between a list and an array?”. The difference between data and variables is often lost of the person who starts their programming career in a high level language.

We hit this subject pretty hard in the first chapter of Effective Perl Programming, 2nd Edition in at least three Items:

  • Item 9: Know the difference between lists and arrays.
  • Item 10: Don’t assign undef when you want an empty array.
  • Item 12: Understand context and how it affects operations.

Here’s the current answer in perlfaq4:


A list is a fixed collection of scalars. An array is a variable that holds a variable collection of scalars. An array can supply its collection for list operations, so list operations also work on arrays:

# slices
( 'dog', 'cat', 'bird' )[2,3];
@animals[2,3];

# iteration
foreach ( qw( dog cat bird ) ) { ... }
foreach ( @animals ) { ... }

my @three = grep { length == 3 } qw( dog cat bird );
my @three = grep { length == 3 } @animals;

# supply an argument list
wash_animals( qw( dog cat bird ) );
wash_animals( @animals );

Array operations, which change the scalars, reaaranges them, or adds or subtracts some scalars, only work on arrays. These can’t work on a list, which is fixed. Array operations include shift, unshift, push, pop, and splice.

An array can also change its length:

$#animals = 1;  # truncate to two elements
$#animals = 10000; # pre-extend to 10,001 elements

You can change an array element, but you can’t change a list element:

$animals[0] = 'Rottweiler';
qw( dog cat bird )[0] = 'Rottweiler'; # syntax error!

foreach ( @animals ) {
	s/^d/fr/;  # works fine
	}

foreach ( qw( dog cat bird ) ) {
	s/^d/fr/;  # Error! Modification of read only value!
	}

However, if the list element is itself a variable, it appears that you can change a list element. However, the list element is the variable, not the data. You’re not changing the list element, but something the list element refers to. The list element itself doesn’t change: it’s still the same variable.

You also have to be careful about context. You can assign an array to a scalar to get the number of elements in the array. This only works for arrays, though:

my $count = @animals;  # only works with arrays

If you try to do the same thing with what you think is a list, you get a quite different result. Although it looks like you have a list on the righthand side, Perl actually sees a bunch of scalars separated by a comma:

my $scalar = ( 'dog', 'cat', 'bird' );  # $scalar gets bird

Since you’re assigning to a scalar, the righthand side is in scalar context. The comma operator (yes, it’s an operator!) in scalar context evaluates its lefthand side, throws away the result, and evaluates it’s righthand side and returns the result. In effect, that list-lookalike assigns to $scalar it’s rightmost value. Many people mess this up becuase they choose a list-lookalike whose last element is also the count they expect:

my $scalar = ( 1, 2, 3 );  # $scalar gets 3, accidentally

Get a review copy of Effective Perl Programming

Addison-Wesley is ramping up its promotion of Effective Perl Programming, 2nd Edition. They’ll send out several free copies to people who like to review books and tell their friends about it. If you’re interested in getting one of those review copies, send me a note at . If I don’t already know you, introduce yourself and send me some links to your previous book reviews, your blog, and so on.

Effective Perl Programming at Frozen Perl 2010

On Friday, February 5, 2010 at Frozen Perl, I’m teaching a new master class based on our book, Effective Perl Programming, 2nd Edition. The master class is a low-cost, condensed class format given at a public Perl event. The cost of this one-day class is $100, and there are a limited number of $25 seats for full-time students.

In the one-day class for intermediate Perl programmers, I’ll cover selected topics from the book, including:

  • Working with Unicode in Perl
  • Tricks with filehandles
  • New regex features in Perl 5.10 and later
  • Playing with pack
  • Using closures to make things simpler
  • and other topics as time allows

Although the book hasn’t been published yet, it is available for pre-order on Amazon, and attendees to the class can get a sneak peek at the working manuscript as well as a soft copy of the course slides.

Effective Perl Programming, the blog

Effective Perl Programming, 2nd Edition, front cover. Josh McAdams and I recently gave Addison Wesley our final manuscript for Effective Perl Programming, 2nd Edition, an update to Joseph Hall’s excellent first edition published in 1998. Not only have we updated the existing material up to Perl 5.10, but we’ve also quite a bit of new material for the evolution of Perl and its practices in the last 12 years.

We’re not stopping there, though. We had plenty more that we wanted to write and couldn’t fit in the book. Sometimes we had more to say on an item in the book and sometimes we wanted to include a topic that didn’t have the space for it. We will trickle out that extra material here. Additionally, we’d like to keep up a conversation with the readers of the latest edition to refine and expand the material that is already there.

Perl 5.8 features

Perl 5.8, mostly a release that improved much of the unseen things in Perl, did have some notable, user-visible features.


  • Restricted hashes with Hash::Util (5.8.0)
  • Use sitecustomize.pl to affect all programs (5.8.7)
  • PerlIO is now the default (5.8.0)