Find dates with Regexp::Common

[This is a mid-week bonus item]

Suppose you want to find some dates inside a big string. The problem with dates is that there are some many ways to write them, and even if you can come up with a pattern to get the structure right, can you handle the different locales and languages that use different words to refer to the same day or month?

In Item 42. Don’t reinvent the regex, you saw the Regexp::Common module. It creates the regular expressions that many people often get wrong because they miss some subtle part of the pattern.

Regexp::Common::time‘s date handling is quite amazing though. It’s a plugin, so you need to install it separately. Instead of specifying a regular expression, you can use the -pat option to specify the structure of the date, using a string much like that for strftime, although with some regular expression bits added. From the semi-pattern, it constructs a much more complicated pattern that does the right thing. Since the module gives you a regex object, you can print it to see the pattern:

In this example, you extract the

use Regexp::Common qw(time);

my @lines = `ls -l`;

# May  3  2010
# Jan 17 18:21
$date_re = $RE{time}{strftime}{
	-pat => '%b\s+%_d\s+(?:%Y|%_H:%M)'
	};

print "Pattern is------\n$date_re\n-------\n";

This pattern reflects the national representation for the en_US locale:

Pattern is------
(?=[SAFOJNMD])(?>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+(?:0[1-9]|[12]\d|3[01]|(?<!\d)[1-9])\s+(?:\d{4}|(?:(?=\d)(?:[01]\d|2[0123]|(?<!\d)\d)):(?:[0-5]\d))
-------

You can change your locale, in this case, to tr_TR for Turkish, to get a different pattern that has the same structure, although I don’t know if the Turks write their dates like this:

Pattern is------
(?=[AOTNKEHM\Å])(?>Oca|\Å\žub|Mar|Nis|May|Haz|Tem|A\Ä\Ÿu|Eyl|Eki|Kas|Ara)\s+(?:0[1-9]|[12]\d|3[01]|(?<!\d)[1-9])\s+(?:\d{4}|(?:(?=\d)(?:[01]\d|2[0123]|(?<!\d)\d)):(?:[0-5]\d))
-------

You can now use this pattern to match dates in text. Here’s a program that takes in a line and puts ^ characters under the parts it thinks are dates:

use Regexp::Common qw(time);

my @lines = `ls -l`;

# May  3  2010
# Jan 17 18:21
$date_re = $RE{time}{strftime}{
	-pat => '%b\s+%_d\s+(?:%Y|%_H:%M)'
	};

while( defined( my $line = <> ) {
	next unless $line =~ /$date_re/;
	my $start = $-[0];
	my $stop  = $+[0];
	
	my $underline = ( ' ' x $-[0] ) . ( '^' x ($stop - $start) );
	
	print $line;
	print $underline, "\n\n";	
	}

You can test this by piping some output into this program. Here’s an extract of output from the Unix ls command. Notice that the first date has a time instead of a year, but you still find it:

$ ls -l /usr/local/perls/perl-5.10.1/lib/site_perl/5.10.1 | perl date_finder.pl
drwxr-xr-x   4 brian  wheel    136 Dec  9 01:58 Acme
                                   ^^^^^^^^^^^^

-r--r--r--   1 brian  wheel  32517 Jul  6  2007 AppConfig.pm
                                   ^^^^^^^^^^^^


-r--r--r--   1 brian  wheel  54725 Jul 19  2007 Expect.pm
                                   ^^^^^^^^^^^^

-r--r--r--   1 brian  wheel  43735 Jul 19  2007 Expect.pod
                                   ^^^^^^^^^^^^

drwxr-xr-x   3 brian  wheel    102 May 16  2010 ExtUtils
                                   ^^^^^^^^^^^^

drwxr-xr-x   3 brian  wheel    102 Jun 17  2010 local
                                   ^^^^^^^^^^^^

-r--r--r--   1 brian  wheel   9137 Jun 15  2009 lwpcook.pod
                                   ^^^^^^^^^^^^

-r--r--r--   1 brian  wheel  25447 Jun 15  2009 lwptut.pod
                                   ^^^^^^^^^^^^

drwxr-xr-x   4 brian  wheel    136 May 28  2010 namespace
                                   ^^^^^^^^^^^^

-r--r--r--   1 brian  wheel   1931 Sep 22  2009 oose.pm
                                   ^^^^^^^^^^^^

Notice that this would be hard to do with split if you run into filenames that have spaces. You can’t depend on fixed column widths because the file sizes can move things around. It turns out to be pretty annoying.

One thought on “Find dates with Regexp::Common”

Comments are closed.