Memory-map files instead of slurping them

The conventional wisdom for slurping a file into a Perl program is to actually load the file into a program. We showed some of these in Item 53: Consider different ways of reading from a stream.

There are several idioms for doing it, from doing it yourself:

my $text = do { local( @ARGV, $/ ) = $file; <> };

or using an optimized module such as File::Slurp.

use File::Slurp qw(read_file);

my $text = read_file( $file );

Given a large file, say, something that is 2 GB, you end up with a memory footprint that is at least the file size. This program to load a 2 GB file took 11 seconds to load the file on my Mac Pro. The memory footprint rose to 2.25 GB and stayed there even after $text went out of scope:

#!/usr/bin/perl
use strict;
use warnings;

print "I am $$\n";

use File::Slurp;

{
my $start = time;
my $text = read_file( $ARGV[0] );
my $loadtime = time - $start;
print "Loaded file in $loadtime seconds\n";

my $count = () = $text =~ /abc/;

print "Found $count occurances\n";
}

print "Press enter to continue...";

<STDIN>;

The problem is in the concept that you have to somehow capture and retain control of the data to make use of it.

To solve this, you should avoid the painful part. That is, don’t load the file at all. That I/O is really slow! You can memory-map, or mmap, the file. The name comes from the system call that makes it possible.

Instead of loading the file, you use mmap to make a connection between your address space and the file on the disk. You don’t have to worry about how this happens, but basically you use part of a disk file as if it was actually in memory. The advantage is that you don’t have the I/O overhead, so there is no load time, and since you don’t have to make space to hold the file in memory, you don’t pay a memory footprint.

This program use File::Map, you “load” the file instantly and it’s actual memory footprint was under 3 MB (three orders of magnitude less!):

#!/usr/bin/perl
use strict;
use warnings;

use File::Map qw(map_file);

print "I am $$\n";

{
my $start = time;
map_file my $map, $ARGV[0];
my $loadtime = time - $start;
print "Loaded file in $loadtime seconds\n";

my $count = () = $map =~ /abc/;

print "Found $count occurances\n";
}

<STDIN>;

The $map acts just like a normal Perl string, and you don’t have to worry about any of the mmap details. When the variable goes out of scope, the map is broken and your program doesn’t suffer from a large chunk of unused memory.

In Tim Bray’s Wide Finder contest to find the fatest way to process log files with “wider” rather than “faster” processors, the winning solution was a Perl implementation using mmap (although using the older Sys-Mmap). Perl had nothing special in that regard because most of the top solutions used mmap to avoid the I/O penalty.

The mmap is especially handy when you have to do this with several files at the same time (or even sequentially if Perl needs to find a chunk of contiguous memory). Since you don’t have the data in real memory, you can mmap as many files as you like and work with them simultaneously.

Also, since the data actually live on the disk, different programs running at the same time can share the data, including seeing the changes each program makes (although you have to work out the normal concurrency issues yourself). That is, mmap is a way to share memory.

The File::Map module can do much more too. It allows you to lock filehandles, and you can also synchronize access from threads in the same process.

If you don’t actually need the data in your program, don’t ever load it: mmap it instead.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Leave a comment

0 Comments.

Leave a Reply

You must be logged in to post a comment.