Compare dates as strings when you can.

Just because you find a module that does something doesn’t mean that you have to use it. There are many excellent date and time modules on CPAN, including most people’s favorite, DateTime. In your heady rush for program purity and elegance, don’t think that you always have to use objects to do your work. Sometimes the overhead of objects, which have to call (perhaps many) subroutines to do their work, is too expensive.

You’ve already seen this in the various techniques that you can use to sort a list in Item 22: Learn the myriad ways of sorting.

If you don’t need to use the dates in your input for anything other than ordering (for example, you don’t need to compute durations or align the same week of different years), you might consider ditching the date modules altogether. That sounds like odd advice when most of Effective Perl Programming is about using the right module, but as you become an experienced and Effective Perler, it’s up to you to judiciously apply any advice you get, including whether or not you should use modules.

Consider a format where you have some input data that contains a date on each line:

11/19/2010 some line of data
6/4/2009 some other line of data
5/1/2009 more data

Create a script to make some sample data:

# datemaker
my @years   = 1768 .. 2010;
my @months  = 1 .. 12;
my @dates   = 1 .. 28;
my @strings = qw( cat dog bird );

foreach ( 1 .. $ARGV[0] )
	{
	printf "%d/%d/%d\t%s\n",
		map { $_->[ int rand @$_ ] }
		\( @months, @dates, @years, @strings );
	}

Create some files of various sizes:

% perl datemaker 10 > dates10.txt
% perl datemaker 1000 > dates1000.txt
% perl datemaker 10000 > dates10000.txt
% perl datemaker 100000 > dates100000.txt
% perl datemaker 1000000 > dates1000000.txt

If all you want to do is put those lines in order, you can split each line, split each date field, and use the reconstituted YYYYMMDD date in a Schwartzian Transform:

# sort_dates_sprintf
print
map  {  $_->[1] }
sort {  $a->[0] <=> $b->[0]  }
map  { 
	my( $date ) = split;
	my( $m, $d, $y ) = split m|/|, $date;
	[ 
		sprintf( "%4d%02d%02d", $y, $m, $d ),
		$_
	]
	}
<>;

If you wanted to do that with DateTime, your program looks almost the same, and at the program level that looks like the same level of complexity, since the structure doesn’t change that much:

# sort_dates_datetime
use DateTime;

print
map  {  $_->[1] }
sort {  $a->[0] <=> $b->[0]  } # DateTime objects!
map  { 
	my( $date ) = split;
	my( $m, $d, $y ) = split m|/|, $date;
	[ 
		DateTime->new( year => $y, month => $m, day => $d ),
		$_
	]
	}
<>;

However, when you compare run times, you find that the DateTime version is quite a bit slower, by at least an order of magnitude. This is a logarithmic scale on the Y axis, so lines that look parallel on the plot are actually diverging very quickly.

That’s not to say that you shouldn’t use DateTime. What if you also need to validate the dates? You don’t validate the dates in the sort_dates_sprintf version, so you’re missing part of the service that DateTime provides. Can you keep that portion of DateTime without the speed penalty? Sure.

After you construct the DateTime object, you can immediately turn the date back into a string. That way, you don’t force the sort portion of the Schwartzian Transform to use the overloaded <=> repeatedly on the DateTime objects in $a and $b:

# sort_dates_hybrid
use DateTime;

print
map  {  $_->[1] }
sort {  $a->[0] <=> $b->[0]  } # no DateTime objects here!
map  { 
	my( $date ) = split;
	my( $m, $d, $y ) = split m|/|, $date;
	[ 
		DateTime->new( year => $y, month => $m, day => $d )->ymd(''),
		$_
	]
	}
<>;

Although you still have a speed penalty for using DateTime, it’s not as bad as it was. Remember, this is a logarithmic scale, so that the new line for the hybrid solution is much better than the full DateTime solution:

The particular numbers here don’t matter as much as their relative values, and these numbers were generated on a standard MacBook Air (1,1) with a standard perl-5.10.1. As always, you should benchmark everything on your own systems.

Things to remember

  • Modules can have significant overhead that you don’t need.
  • Even a tiny bit of overhead can have dramatic perfomance implications over millions of iterations.
  • Converting dates to strings and comparing those strings lexically has a significant performance advantage.
Leave a comment

2 Comments.

  1. Just ordered the second edition book from Amazon – I’m very psyched, as I loved the first edition. (Hmm, why WordPress when there are so many Perl options…)

  2. There are many Perl options for blog software, but they all suck, just like every other blogging package. We haven’t found one we like, so we’re trying WordPress. Besides, just because we like Perl doesn’t mean we have to like what people create with it, or avoid software in other languages. So far WordPress hasn’t been that annoying, but I don’t know if we’d choose it again.

Leave a Reply


[ Ctrl + Enter ]