Use Regexp::Common to find locale-specific dates

[This is a mid-week bonus item, and it’s a bit of a departure from much of what you have already seen on this blog. This is just some code that I had to write this week and I thought you’d like to see it.]

I had to find some dates inside a big string, and the problem with dates is that there are some many ways to write them, and even if I get the format right, some of the machines might use another locale. My string comes from an ls I run as a remote command, which might show the date in one of two formats. The files changed in the last six months replaces the year with the time:

$ ls -l
total 7400
-rw-r--r--@  1 brian  staff      433 Jun 22  2010 Makefile
-rw-r--r--@  1 brian  staff   107721 Jan 19 09:08 appa.xml
-rw-rw-r--@  1 brian  staff    76873 Jan 19 00:18 appb.xml
-rw-rw-r--   1 brian  staff     1802 Jan 14 21:17 book.xml
-rw-rw-r--   1 brian  staff  2457812 Jul 21  2010 book.xml.pdf
-rw-rw-r--   1 brian  staff     4360 Jul 21  2010 bookinfo.xml
-rw-r--r--@  1 brian  staff    25626 Jan 19 09:07 ch00.xml

Here’s the program I wrote to figure out which parts of that string is the dates, using Regexp::Common (Item 42. Don’t reinvent the regex):

use Regexp::Common qw(time);

my @lines = `ls -l`;

# May  3  2010
# Jan 17 18:21
$date_re = $RE{time}{strftime}{
	-pat => '%b\s+%_d\s+(?:%Y|%_H:%M)'
	};

foreach my $line ( @lines ) {
	next unless $line =~ /$date_re/;
	my $start = $-[0];
	my $stop  = $+[0];
	
	my $underline = ( ' ' x $-[0] ) . ( '^' x ($stop - $start) );
	
	print $line;
	print $underline, "\n\n";	
	}

That regex is more sophisticated than it looks. I didn’t have to do anything to deal with month names and abbreviations, but the module will figure it out for me based on the locale of the machine on which I run the command. The regex changes depending on the language that I decide to use:

$ LC_ALL=tr_TR perl -MRegexp::Common=time -le 'print $RE{time}{strftime}{-pat=>"%b"}'
(?:(?=[AOTNKEHM\Å])(?>Oca|\Å\ub|Mar|Nis|May|Haz|Tem|A\Ä\u|Eyl|Eki|Kas|Ara))

$ LC_ALL=en_US.UTF-8 perl -MRegexp::Common=time -le 'print $RE{time}{strftime}{-pat=>"%b"}'
(?:(?=[SAFOJNMD])(?>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))

$ LC_ALL=es_ES.UTF-8 perl -MRegexp::Common=time -le 'print $RE{time}{strftime}{-pat=>"%b"}'
(?:(?=[enadmsjfo])(?>ene|feb|mar|abr|may|jun|jul|ago|sep|oct|nov|dic))

Part of my code demonstrated that it found the date part of the string by underlining what it thought the date portion was. That’s all the fooling around with the @- and @+ special variables. Those are the string locations for the start and end positions of the various capture buffers. The numbers in index 0 applies to $&, the index 1 applies to $1, and so on:

-rw-r--r--@  1 brian  staff      433 Jun 22  2010 Makefile
                                     ^^^^^^^^^^^^

-rw-r--r--@  1 brian  staff   107721 Jan 19 09:08 appa.xml
                                     ^^^^^^^^^^^^

-rw-rw-r--@  1 brian  staff    76873 Jan 19 00:18 appb.xml
                                     ^^^^^^^^^^^^

-rw-rw-r--   1 brian  staff     1802 Jan 14 21:17 book.xml
                                     ^^^^^^^^^^^^

-rw-rw-r--   1 brian  staff  2457812 Jul 21  2010 book.xml.pdf
                                     ^^^^^^^^^^^^

-rw-rw-r--   1 brian  staff     4360 Jul 21  2010 bookinfo.xml
                                     ^^^^^^^^^^^^

-rw-r--r--@  1 brian  staff    25626 Jan 19 09:07 ch00.xml
                                     ^^^^^^^^^^^^

This code also has to work on systems with very ancient versions of ls. There are some switches that could have made this code much easier, especially if I can make the date column the epoch time instead do it’s not a combination of whitespace-separated fields itself.

The -T switch on Mac OS X and FreeBSD displays all dates in the same format, even for the recently changed ones.
Linux versions might have the --time-style.
FreeBSD has the -D switch to specify the date format.

I’d much rather use perl, but the equivalent is much uglier even though I can choose my field separator. Perl is ultra-portable and available in most places, but I have to do more work on a one-liner:

$ perl -le 'for(glob(q|*|)){print join qq|\t|, stat(), $_}'

However, this causes headaches later when I need to run this as a remote command and I still have to process the results to turn the data into human-readable output. The ls -l is much nicer without requiring more work than I’d do normally.

And, as a bonus to this bonus, I discovered that Date::Parse is smart enough to deal with a date like Dec 31 12:34. It realizes that it was last December, not the one from the current year. I can feed both formats into that module and still have the dates sort correctly.