Avoid perl housekeeping for hot loop optimization

David Golden gave a talk at The Perl Conference 2017 where he showed Real World Optimization for the MongoDB Perl driver. He spoke about many big performance gains and you can watch the talk for that, but at the end he talked about various micro-optimizations.

Small gains in “hot loops” (code that executes many, many times) can add up to significant savings. David was able to cut off 20% of the runtime with some of these micro-optimizations. All of these are his ideas but they are the very thing the Effective Perl programmer is curious about.

Continue reading “Avoid perl housekeeping for hot loop optimization”

Profile with Devel::NYTProf

Profile before you decide where to optimize—you might be surprised where you’re losing all of your performance.

We won’t go into all the details of profiling in this Item, but you can read about those in Mastering Perl. In short, profilers count something then report the results. They can track any of the things that you might care about. The Devel::NYTProf module, like most profilers, tracks time, counting the statements you run and how long they take. Continue reading “Profile with Devel::NYTProf”

Don’t make Perl do more work than it needs to

My choice of algorithms and data organization can lead to orders of magnitude of performance differences between functionally equivalent implementations of my programs. Choosing the right way to do something can save me orders of magnitude in processing.

I wrote some code to loop over lines in a file and modify a couple of elements in each line. My code seems very innocent, but the actual field modifications were a little more complex; I replaced them with lc and uc to make things simple for this example:

while (<>) {
  chomp;
  my @x = split "\t";
  $x[3] = lc $x[3];
  $x[5] = uc $x[5];
  print join("\t", @x), "\n";
}

I have a loop that splits each line of a file on the tab character then modifies the fourth and sixth field.
Look closely at the loop: I’m making Perl do a lot more work than it needs to.

First, I cal chomp on each line, but I’m adding the newline right back. There is no reason to do that. If I leave it there I get the same result:

while (<>) {
  my @x = split "\t";
  $x[3] = lc $x[3];
  $x[5] = uc $x[5];
  print join("\t", @x);
}

chomping when I don’t need to is a minor issue, but it is not likely going to destroy the performance of any program.

Looking a little closer, I see a much bigger inefficiency if I know the format of the data. Assume each line of data contains many fields; many being some number greater than six in this specific example. Since I’m are only acting on fields four and six, why do I split the entire line just to put it back together?

With a couple of more arguments, I can tell split to limit the number of fields it creates. I can limit my results to seven items since I don’t care to modify any beyond that:

while (<>) {
  my @x = split "\t", $_, 7;
  $x[3] = lc $x[3];
  $x[5] = uc $x[5];
  print join("\t", @x);
}

If each line only contains seven, or even ten, elements, I won’t see much if any improvement. However, if each line contains dozens or hundreds of fields, my potential speed-up is huge.

There is even more I can do to milk performance out of my loop if I control the data format. If I move the columns that I need to modify to the front of each row, I don’t need to split into so many fields:

while (<>) {
  my @x = split "\t", $_, 3;
  $x[0] = lc $x[0];
  $x[1] = uc $x[1];
  print join("\t", @x);
}

Measure the improvement

Just to be sure that these changes are really making my code faster, I do a little benchmarking to get a feel for the relative performance differences:

use warnings;
use strict;
use Benchmark qw(timethese);

my @data;

for (0..10_000) {
  $data[$_] = join "\t", map { chr(65 + (int rand(52))) } (0..100);
}

timethese(500, {
  'standard' => sub {
    for (@data) {
      chomp;
      my @x = split "\t";
      $x[3] = lc $x[3];
      $x[5] = uc $x[5];
      $_ = join("\t", @x) . "\n";
    }
  },
  'no_chomp' => sub {
    for (@data) {
      my @x = split "\t";
      $x[3] = lc $x[3];
      $x[5] = uc $x[5];
      $_ = join("\t", @x);
    }
  },
  'smaller' => sub {
    for (@data) {
      my @x = split "\t", $_, 7;
      $x[3] = lc $x[3];
      $x[5] = uc $x[5];
      $_ = join("\t", @x);
    }
  },
  'smallest' => sub {
    for (@data) {
      my @x = split "\t", $_, 3;
      $x[0] = lc $x[0];
      $x[1] = uc $x[1];
      $_ = join("\t", @x);
    }
  },
});

In this benchmark I experimented on ten thousand records, each with a hundred fields. The benchmarks measure:

  • the initial, or “standard” case.
  • the case where I just removed a chomp, “no_chomp”.
  • the case where I limit split, “smaller”.
  • the case where I reorder the inbound data, “smallest”.

The [reordered] results tell me quite a bit:

Benchmark: timing 500 iterations of no_chomp, smaller, smallest, standard…
standard: 451 wallclock secs (449.66 usr + 0.34 sys = 450.00 CPU) @ 1.11/s (n=500)
no_chomp: 451 wallclock secs (446.18 usr + 0.41 sys = 446.59 CPU) @ 1.12/s (n=500)
smaller: 39 wallclock secs (39.15 usr + 0.03 sys = 39.18 CPU) @ 12.76/s (n=500)
smallest: 19 wallclock secs (18.98 usr + 0.01 sys = 18.99 CPU) @ 26.33/s (n=500)

Removing chomp had an almost unnoticeable effect. However, I recuded my processing tenfold by limiting split. I made my code even faster by reordering the inbound data.

Just to see what effect number of fields really has, I reduced the size of the data so that each record had ten fields. The results were obviously less impressive, though still visible:

Benchmark: timing 500 iterations of no_chomp, smaller, smallest, standard…
standard: 58 wallclock secs (57.50 usr + 0.05 sys = 57.55 CPU) @ 8.69/s (n=500)
no_chomp: 55 wallclock secs (55.13 usr + 0.04 sys = 55.17 CPU) @ 9.06/s (n=500)
smaller: 38 wallclock secs (37.67 usr + 0.03 sys = 37.70 CPU) @ 13.26/s (n=500)
smallest: 18 wallclock secs (18.46 usr + 0.01 sys = 18.47 CPU) @ 27.07/s (n=500)

So what does this tell me? Well, as far as a specific optimization goes, limiting split is probably a good idea if I’m trying to optimize my processing and really don’t need every field.

However, there is a larger moral to this story and that is: When performance is a concern, don’t make your code do work that it doesn’t have too. The most important is my choice of algorithms and related data structures for solving my problem. This can make or break my performance. After that, knowing specific optimizations such as limiting split fields, help me get just a little more performance out of my code.

Detect regular expression match variables in your code

[UPDATE: this is not a problem in v5.18 and later.]

In Item 33: “Watch out for match variables”, you found out that the match variable $`, $&, and $` come with a performance hit. With all of the module code that you might use, you might be using those variables even though you didn’t code with them yourself. Continue reading “Detect regular expression match variables in your code”

Compare dates as strings when you can.

Just because you find a module that does something doesn’t mean that you have to use it. There are many excellent date and time modules on CPAN, including most people’s favorite, DateTime. In your heady rush for program purity and elegance, don’t think that you always have to use objects to do your work. Sometimes the overhead of objects, which have to call (perhaps many) subroutines to do their work, is too expensive. Continue reading “Compare dates as strings when you can.”