Use Perl 5.22’s <<>> operator for safe command-line handling

We’ve had the three argument open since Perl 5.6. This allows you to separate the way you want to interact with the file from the filename.

Old Perl requires you to include the mode and filename together, giving Perl the opportunity to interpret what you mean: Continue reading “Use Perl 5.22’s <<>> operator for safe command-line handling”

Avoid modifying scalars connected to string filehandles

Since Perl 5.8, you can treat a string as a file (Item 54. Open filehandles to and from strings). You can open a filehandle, read from the string, write to the string, and most of the other things that you can do with a file. There are some gotchas though, when you deal with that string as a normal string and a filehandle at the same time. We’ve filed this as RT 78980: Odd behavior when string filehandles and scalar assignment collide. Continue reading “Avoid modifying scalars connected to string filehandles”

Effective Perl free sample chapter: Files and Filehandles

Addison-Wesley converted our chapter on “Files and Filehandles” to HTML and put it online for as a free sample chapter. I selected this chapter as the free sample because it was the most fun to write but also the most valuable to new Perl programmers. Filehandles are the way you interact with the world, and using them wisely can give your program quite a bit of flexibility and make many tasks much easier.

Here’s the list of Items from that chapter, each of which you can read for free online:

We’ve also added more Items for “Files and Filehandles” in this blog, which you can also read for free. However, don’t forget about that Donate button on the right had side of the page if you find this site valuable. Or, buy our book and encourage all your friends to buy our book. Donations and book sales give us a little motivational boost to keep going. :)

Memory-map files instead of slurping them

The conventional wisdom for slurping a file into a Perl program is to actually load the file into a program. We showed some of these in Item 53: Consider different ways of reading from a stream.

There are several idioms for doing it, from doing it yourself:

my $text = do { local( @ARGV, $/ ) = $file; <> };

or using an optimized module such as File::Slurp.

use File::Slurp qw(read_file);

my $text = read_file( $file );

Given a large file, say, something that is 2 GB, you end up with a memory footprint that is at least the file size. This program to load a 2 GB file took 11 seconds to load the file on my Mac Pro. The memory footprint rose to 2.25 GB and stayed there even after $text went out of scope:

#!/usr/bin/perl
use strict;
use warnings;

print "I am $$\n";

use File::Slurp;

{
my $start = time;
my $text = read_file( $ARGV[0] );
my $loadtime = time - $start;
print "Loaded file in $loadtime seconds\n";

my $count = () = $text =~ /abc/;

print "Found $count occurances\n";
}

print "Press enter to continue...";

<STDIN>;

The problem is in the concept that you have to somehow capture and retain control of the data to make use of it.

To solve this, you should avoid the painful part. That is, don’t load the file at all. That I/O is really slow! You can memory-map, or mmap, the file. The name comes from the system call that makes it possible.

Instead of loading the file, you use mmap to make a connection between your address space and the file on the disk. You don’t have to worry about how this happens, but basically you use part of a disk file as if it was actually in memory. The advantage is that you don’t have the I/O overhead, so there is no load time, and since you don’t have to make space to hold the file in memory, you don’t pay a memory footprint.

This program use File::Map, you “load” the file instantly and it’s actual memory footprint was under 3 MB (three orders of magnitude less!):

#!/usr/bin/perl
use strict;
use warnings;

use File::Map qw(map_file);

print "I am $$\n";

{
my $start = time;
map_file my $map, $ARGV[0];
my $loadtime = time - $start;
print "Loaded file in $loadtime seconds\n";

my $count = () = $map =~ /abc/;

print "Found $count occurances\n";
}

<STDIN>;

The $map acts just like a normal Perl string, and you don’t have to worry about any of the mmap details. When the variable goes out of scope, the map is broken and your program doesn’t suffer from a large chunk of unused memory.

In Tim Bray’s Wide Finder contest to find the fatest way to process log files with “wider” rather than “faster” processors, the winning solution was a Perl implementation using mmap (although using the older Sys-Mmap). Perl had nothing special in that regard because most of the top solutions used mmap to avoid the I/O penalty.

The mmap is especially handy when you have to do this with several files at the same time (or even sequentially if Perl needs to find a chunk of contiguous memory). Since you don’t have the data in real memory, you can mmap as many files as you like and work with them simultaneously.

Also, since the data actually live on the disk, different programs running at the same time can share the data, including seeing the changes each program makes (although you have to work out the normal concurrency issues yourself). That is, mmap is a way to share memory.

The File::Map module can do much more too. It allows you to lock filehandles, and you can also synchronize access from threads in the same process.

If you don’t actually need the data in your program, don’t ever load it: mmap it instead.