Use scalar references to pass large data without copying.

References aren’t just for data structures, and many people overlook the benefit of references to simple scalars. With references to arrays and hashes you can keep those data structures in tact when you pass them to or return them from subroutines (Item 46: Pass references instead of copies). You don’t need to worry about scalar values because they are a single item in both the non-reference and reference form.

The benefit of a scalar reference comes when you realize that perl is stack-based. That is, when it wants to pass things to a subroutine, it puts things on a stack and calls the subroutine. The subroutine takes the right number of things off the stack, does its processing, and puts its return values on the stack. The caller takes the return values off the stack and the program continues. Moving those data onto and off the stack can involve quite a bit of copying.

When you call a subroutine, perl really only puts pointers to the data on the stack. These aren’t pointers in the C sense, but we use the word to distinguish it from Perl’s specialized definition of “reference”. It’s not until you want to assign the values to variables that perl has to copy the original data to initialize the new variables. This lazy initialization is a performance optimization perl uses to avoid copying data it doesn’t need to. When you unpack your argument list into variables, you copy the data into the variable.

Here’s a sequence of events that shows that you make several copies of the same data when you pass a string as an argument, store it in a temporary variable, and then return the unchanged string. This example uses a short string of abc replicated 10 times, but imagine this to be a large XML document or something similarly daunting:

use Devel::Peek;
use 5.010;

my $old_version = 'abc' x 10;
say Dump( $old_version );

my $new_version = copy_string( $old_version );
say Dump( $new_version );

sub copy_string {
	my( $copy ) = @_;
	say Dump( $copy );
	$copy;
	}

The output shows that you made three different copies of the same data, one each for the original, the temporary variable in the subroutine, and the return value. It’s the address after PV = and right before the string that shows the location of the data (rather than the location of the variable). In this case, those addresses are 0x219490, 0x2194b0, and 0x219430:

SV = PV(0x801080) at 0x80f680
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x219490 "abcabcabcabcabcabcabcabcabcabc"\0
  CUR = 30
  LEN = 32

SV = PV(0x801028) at 0x81b3d0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x2194b0 "abcabcabcabcabcabcabcabcabcabc"\0
  CUR = 30
  LEN = 32

SV = PV(0x8015f8) at 0x80f670
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x219430 "abcabcabcabcabcabcabcabcabcabc"\0
  CUR = 30
  LEN = 32

There’s a way, not necessarily a best practice, that people use to get around that first copy. When you use @_ directly, you use and affect the original data because you actually use the original data instead of a copy of it. That’s why you sometimes see people use it directly in subroutines to change data in place:

use Devel::Peek;
use 5.010;

my $old_version = 'abc' x 10;
say Dump( $old_version );

my $new_version = copy_string( $old_version );
say Dump( $new_version );

sub copy_string {
	say Dump( $_[0] );
	$_[0];
	}

However, since copy_string modifies its argument in-place, you don’t need the return value.

use Devel::Peek;
use 5.010;

my $old_version = 'abc' x 10;
say Dump( $old_version );

copy_string( $old_version );

sub copy_string {
	say Dump( $_[0] );
	$_[0];
	}

Now the output shows that $old_version and $_[0] are the same data (just not equivalent strings). Notice that the address of the two variables (0x801080 in this case) are the same. That’s the special aliasing magic:

SV = PV(0x801080) at 0x80f680
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x219440 "abcabcabcabcabcabcabcabcabcabc"\0
  CUR = 30
  LEN = 32

SV = PV(0x801080) at 0x80f680
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x219440 "abcabcabcabcabcabcabcabcabcabc"\0
  CUR = 30
  LEN = 32

Using @_ directly might be fine for very light jobs, but what if the argument list is more complicated? You don’t want track the position of the argument in @_.

Instead, you use a scalar reference. The argument list contains a real Perl reference, so you aren’t copying anything other than the connection to the data.
You can store the reference value in a named variable without making a copy of the end values:

use Devel::Peek;
use 5.010;

my $old_version = 'abc' x 10;
say Dump( $old_version );

my $new_version = copy_string( \$old_version );

sub copy_string {
	my( $ref ) = @_;
	say Dump( $$ref ); # deref to see data, not ref to it
	}

The output shows essentially the same effect as using @_ directly:

SV = PV(0x801080) at 0x80f680
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x2042f0 "abcabcabcabcabcabcabcabcabcabc"\0
  CUR = 30
  LEN = 32

SV = PV(0x801080) at 0x80f680
  REFCNT = 3
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x2042f0 "abcabcabcabcabcabcabcabcabcabc"\0
  CUR = 30
  LEN = 32

The advantage with a reference is that you won’t unintentionally make a copy when you store the value in a variable. Say, for instance, that you need to call another subroutine and pass along one of the arguments, and that second subroutine store it in a temporary variable:

use Devel::Peek;
use 5.010;

my $old_version = 'abc' x 10;
say Dump( $old_version );

copy_string( $old_version );

sub copy_string {
	say Dump( $_[0] );
	another_sub( $_[0] );
	}

sub another_sub {
	my( $string ) = @_;
	say Dump( $string );
	}

Now you’ve made a copy of the data in another_sub:

SV = PV(0x801080) at 0x80f680
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x21d250 "abcabcabcabcabcabcabcabcabcabc"\0
  CUR = 30
  LEN = 32

SV = PV(0x801080) at 0x80f680
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x21d250 "abcabcabcabcabcabcabcabcabcabc"\0
  CUR = 30
  LEN = 32

SV = PV(0x801028) at 0x81b4e0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x204340 "abcabcabcabcabcabcabcabcabcabc"\0
  CUR = 30
  LEN = 32

If you passed along references instead, it’s always the same reference so there’s no danger of creating another copy:

use Devel::Peek;
use 5.010;

my $old_version = 'abc' x 10;
say Dump( $old_version );

copy_string( \ $old_version );

sub copy_string {
	my( $ref ) = @_;
	say Dump( $$ref );
	another_sub( $ref );
	}

sub another_sub {
	my( $ref ) = @_;
	say Dump( $$ref );
	}

The output shows only one copy, even though you used two different temporary variables:

SV = PV(0x801080) at 0x80f680
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x219350 "abcabcabcabcabcabcabcabcabcabc"\0
  CUR = 30
  LEN = 32

SV = PV(0x801080) at 0x80f680
  REFCNT = 3
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x219350 "abcabcabcabcabcabcabcabcabcabc"\0
  CUR = 30
  LEN = 32

SV = PV(0x801080) at 0x80f680
  REFCNT = 4
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x219350 "abcabcabcabcabcabcabcabcabcabc"\0
  CUR = 30
  LEN = 32

Things to remember

  • perl makes copies of data when you assign it to different variables.
  • Pass scalar references as arguments to avoid unnecessary string duplications.
  • You can modify variables in-place through references.
Leave a comment

4 Comments.

  1. This is definitely one of the major Perl disadvantages :sad: . Python and PHP are using lazy coping mechanism (aka copy-on-write). Hope Perl 6 will use it too.

    • I think you’re confusing the copy-on-write semantics with maintaining a single reference to the same data. Perl let’s you have it both ways so you can choose what you would like to do.

      Of all the disadvantages that Perl might have, I don’t think this is a major one. :)

  2. brian, this behavior, in my opinion, affects Perl’s performance and memory usage because Perl copies all non-referenced subroutines arguments to another address in memory. Of course programmer may avoid this if necessary by passing argument by reference as you mentioned above, but it is rarely used by Perl programmers (or your code will be full of “\” characters and besides Perl programmers are not C programmers which are get used to pointers because of C nature).

Leave a Reply


[ Ctrl + Enter ]