<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Effective Perler &#187; regular expressions</title>
	<atom:link href="http://www.effectiveperlprogramming.com/blog/category/book/chapters/regular-expressions/feed" rel="self" type="application/rss+xml" />
	<link>http://www.effectiveperlprogramming.com</link>
	<description>Effective Perl Programming - write better, more idiomatic Perl</description>
	<lastBuildDate>Sat, 28 Jan 2012 02:19:01 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Define grammars in regular expressions</title>
		<link>http://www.effectiveperlprogramming.com/blog/1479</link>
		<comments>http://www.effectiveperlprogramming.com/blog/1479#comments</comments>
		<pubDate>Sun, 18 Dec 2011 22:30:20 +0000</pubDate>
		<dc:creator>brian d foy</dc:creator>
				<category><![CDATA[5.10]]></category>
		<category><![CDATA[item]]></category>
		<category><![CDATA[regular expressions]]></category>

		<guid isPermaLink="false">http://www.effectiveperlprogramming.com/?p=1479</guid>
		<description><![CDATA[[ This is the 100th Item we've shared with you in the two years this blog has been around. We deserve a holiday and we're taking it, so read us next year! Happy Holidays.] Perl 5.10 added rudimentary grammar support in its regular expressions. You could define many subpatterns directly in your pattern, use them [...]]]></description>
			<content:encoded><![CDATA[<p><i>[ This is the 100th Item we've shared with you in the two years this blog has been around. We deserve a holiday and we're taking it, so read us next year! Happy Holidays.]</i></p>
<p>Perl 5.10 added rudimentary grammar support in its regular expressions. You could define many subpatterns directly in your pattern, use them to define larger subpatterns, and, finally, when you have everything in place, let Perl do the work.</p>
<p>There are other ways, some more powerful, that let you do the same thing. This Item is not about those, however, but you can read about <a href="https://www.metacpan.org/module/Regex::Grammar">Regex::Grammars</a>, <a href="https://www.metacpan.org/module/Parse::RecDescent">Parse::RecDescent</a> on your own. Also, you&#8217;re not going to get much of a recommendation of which one you should use for your task. We don&#8217;t know your situation.</p>
<p>To understand this new syntax, you have to study it from the ground up. It&#8217;s not simple, and the terse documentation in <a href="">perlre</a> doesn&#8217;t do much to help. </p>
<h2>Referencing a subpattern</h2>
<p>The first part you need is the ability to call a named part of the pattern (to label a subpattern, see <span class="item">Item 31. Use named captures to label matches</span>). To re-match a labeled subpattern, you use: </p>
<pre class="brush:plain">
(?&#038;NAME)
</pre>
<p>You can use that syntax to rerun a subpattern later:</p>
<pre class="brush:perl">
use v5.10;

my $pattern = qr/
	(?&lt;cat>Buster|Mimi)
	\s+
	(?&#038;cat)
	/x;

foreach ( 'Buster Mimi', 'Mimi Buster', 'Buster', 'Buster Buster' ) {
	say "$_ ", m/$pattern/p ? "matched" : 'nope!';
	}
</pre>
<p>The labeled subpattern has an alternation where either cat name can match. When you reference it again, you re-run the alternation and you can match either cat name again:</p>
<pre class="brush:plain">
Buster Mimi matched
Mimi Buster matched
Buster nope!
Buster Buster matched
</pre>
<p>This is not that same thing as matching the same text a labeled capture group already matched. That&#8217;s the <code>\k&lt;NAME></code>:</p>
<pre class="brush:perl">
\k&lt;NAME>
</pre>
<p>This pattern is a different beast. Whichever cat name matches first also has to match second:</p>
<pre class="brush:perl">
use v5.10;

my $pattern = qr/
	(?&lt;cat>Buster|Mimi)
	\s+
	\k&lt;cat>
	/x;

foreach ( 'Buster Mimi', 'Mimi Buster', 'Buster', 'Buster Buster' ) {
	say "$_ ", m/$pattern/p ? "matched" : 'nope!';
	}
</pre>
<p>Now only one of the strings matches because only one string repeats a cat&#8217;s name:</p>
<pre class="brush:plain">
Buster Mimi nope!
Mimi Buster nope!
Buster nope!
Buster Buster matched
</pre>
<p>Although you won&#8217;t see it here, the <code>(?&#038;NAME)</code> syntax is the trick to matching a recursive pattern since the reference can appear inside the pattern it references.</p>
<h2>Conditional match</h2>
<p>The second building block you need starts with a conditional submatch:</p>
<pre class="brush:plain">
(?(condition)yes-pattern|no-pattern)
</pre>
<p>That <i>condition</i> can be many things, most of which won&#8217;t appear in this Item. Although you see the <code>|</code> character, but this isn&#8217;t an alteration. It&#8217;s like an alternation because the <code>|</code> separates distinct subpatterns, but unlike an alternation because this will only ever try one of the subpatterns and you only get two subpatterns.</p>
<p>The simplest condition is just an ordinal number, which is true only if that capture group matched. Here&#8217;s a pattern that has two capture groups:</p>
<pre class="brush:perl">
use v5.10;

my $pattern = qr/
	(?:           # parens for grouping
		(B)     # $1
		|         # alternation
		(M)     # $2
	)
	(?(1)uster|imi) # conditional match
	/x;

foreach ( qw(Mimi Buster Muster Bimi Roscoe) ) {
	say "$_ ", m/$pattern/p ? "matched ${^MATCH}" : 'nope!';
	}
</pre>
<p>In this pattern, if the <code>(B)</code> matches, the conditional uses <code>uster</code> from the conditional. Otherwise, it uses <code>imi</code>. However, the only thing that can match besides a <code>(B)</code> is the other part of the alteration, the <code>(M)</code>. The output shows that only <code>Mimi</code> or <code>Buster</code> matches:</p>
<pre class="brush:plain">
Mimi matched Mimi
Buster matched Buster
Muster nope!
Bimi nope!
Roscoe nope!
</pre>
<p>You get the same results if you use <code>(2)</code> as the condition and re-arrange the order of the patterns:</p>
<pre class="brush:perl">
my $pattern = qr/
	(?:(B)|(M))
	(?(2)imi|uster)
	/x;
</pre>
<h2>Putting it together</h2>
<p>The condition can also be the literal <code>(DEFINE)</code>. In that case, Perl only allows a yes-branch. And, as its condition implies, it merely <i>defines</i> the patterns and does not execute them. </p>
<p>This means that you can create and label the subpatterns that you need, but not actually assert that any of them match the string. The definitions are just there. This pattern defines and labels three subpatterns then uses none of them:</p>
<pre class="brush:perl">
use v5.10;

my $pattern = qr/
	(?(DEFINE)
		(?&lt;cat>Buster)
		(?&lt;dog>Addie)
		(?&lt;bird>Poppy)
	)
	Mimi
	/x;

foreach ( 'Buster Mimi', 'Roscoe', 'Buster', 'Mimi' ) {
	say "$_ ", m/$pattern/ ? "matched" : 'nope!';
	}
</pre>
<p>It&#8217;s as if the <code>DEFINE</code> bit is not even there:</p>
<pre class="brush:plain">
Buster Mimi matched
Roscoe nope!
Buster nope!
Mimi matched
</pre>
<p>Outside the <code>(DEFINE)</code>, you can reference any of the subpatterns that you created:</p>
<pre class="brush:perl">
use v5.10;

my $pattern = qr/
	(?(DEFINE)
		(?&lt;cat>Buster)
		(?&lt;dog>Addie)
		(?&lt;bird>Poppy)
	)
	(?&#038;cat)
	/x;

foreach ( 'Buster Mimi', 'Roscoe', 'Buster', 'Mimi' ) {
	say "$_ ", m/$pattern/ ? "matched" : 'nope!';
	}
</pre>
<p>Now <code>Buster</code> matches because you reference that defined  subpattern:</p>
<pre class="brush:plain">
Buster Mimi matched
Roscoe nope!
Buster matched
Mimi nope!
</pre>
<p>Now it&#8217;s time for the grammar. Inside the <code>(DEFINE)</code>, you can reference subpatterns you haven&#8217;t defined yet, and your subpatterns can get arbitrarily complex:</p>
<pre class="brush:perl">
use v5.10;

my $pattern = qr/
	(?(DEFINE)
		(?&lt;male> Buster | Roscoe )
		(?&lt;female> Mimi | Juliet )
		(?&lt;cat> (?&#038;male) | (?&#038;female) )
		(?&lt;dog>Addie)
		(?&lt;bird>Poppy)
	)
	(?&#038;cat)
	/x;

foreach ( 'Addie', 'Roscoe', 'Buster', 'Mimi' ) {
	say "$_ ", m/$pattern/ ? "matched" : 'nope!';
	}
</pre>
<p>Even though the cat names are in two different subpatterns, the <code>cat</code> subpattern unifies them so all the cat names match:</p>
<pre class="brush:plain">
Addie nope!
Roscoe matched
Buster matched
Mimi matched
</pre>
<p>You should now be able to understand this regular expression from Tom Christainsen (appearing <a href="http://stackoverflow.com/a/4843579/8817">Stackoverflow</a>). You might have to pick it apart, but you know how all the parts fit together to match the Internet Message Format defined in <a href="http://tools.ietf.org/html/rfc5322">RFC 5322</a>:</p>
<pre class="brush:perl">
$rfc5322 = qr{

   (?(DEFINE)

     (?&lt;address>         (?&#038;mailbox) | (?&#038;group))
     (?&lt;mailbox>         (?&#038;name_addr) | (?&#038;addr_spec))
     (?&lt;name_addr>       (?&#038;display_name)? (?&#038;angle_addr))
     (?&lt;angle_addr>      (?&#038;CFWS)? &lt; (?&#038;addr_spec) > (?&#038;CFWS)?)
     (?&lt;group>           (?&#038;display_name) : (?:(?&#038;mailbox_list) | (?&#038;CFWS))? ; (?&#038;CFWS)?)
     (?&lt;display_name>    (?&#038;phrase))
     (?&lt;mailbox_list>    (?&#038;mailbox) (?: , (?&#038;mailbox))*)

     (?&lt;addr_spec>       (?&#038;local_part) \@ (?&#038;domain))
     (?&lt;local_part>      (?&#038;dot_atom) | (?&#038;quoted_string))
     (?&lt;domain>          (?&#038;dot_atom) | (?&#038;domain_literal))
     (?&lt;domain_literal>  (?&#038;CFWS)? \[ (?: (?&#038;FWS)? (?&#038;dcontent))* (?&#038;FWS)?
                                   \] (?&#038;CFWS)?)
     (?&lt;dcontent>        (?&#038;dtext) | (?&#038;quoted_pair))
     (?&lt;dtext>           (?&#038;NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])

     (?&lt;atext>           (?&#038;ALPHA) | (?&#038;DIGIT) | [!#\$%&#038;'*+-/=?^_`{|}~])
     (?&lt;atom>            (?&#038;CFWS)? (?&#038;atext)+ (?&#038;CFWS)?)
     (?&lt;dot_atom>        (?&#038;CFWS)? (?&#038;dot_atom_text) (?&#038;CFWS)?)
     (?&lt;dot_atom_text>   (?&#038;atext)+ (?: \. (?&#038;atext)+)*)

     (?&lt;text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
     (?&lt;quoted_pair>     \\ (?&#038;text))

     (?&lt;qtext>           (?&#038;NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
     (?&lt;qcontent>        (?&#038;qtext) | (?&#038;quoted_pair))
     (?&lt;quoted_string>   (?&#038;CFWS)? (?&#038;DQUOTE) (?:(?&#038;FWS)? (?&#038;qcontent))*
                          (?&#038;FWS)? (?&#038;DQUOTE) (?&#038;CFWS)?)

     (?&lt;word>            (?&#038;atom) | (?&#038;quoted_string))
     (?&lt;phrase>          (?&#038;word)+)

     # Folding white space
     (?&lt;FWS>             (?: (?&#038;WSP)* (?&#038;CRLF))? (?&#038;WSP)+)
     (?&lt;ctext>           (?&#038;NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
     (?&lt;ccontent>        (?&#038;ctext) | (?&#038;quoted_pair) | (?&#038;comment))
     (?&lt;comment>         \( (?: (?&#038;FWS)? (?&#038;ccontent))* (?&#038;FWS)? \) )
     (?&lt;CFWS>            (?: (?&#038;FWS)? (?&#038;comment))*
                         (?: (?:(?&#038;FWS)? (?&#038;comment)) | (?&#038;FWS)))

     # No whitespace control
     (?&lt;NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])

     (?&lt;ALPHA>           [A-Za-z])
     (?&lt;DIGIT>           [0-9])
     (?&lt;CRLF>            \x0d \x0a)
     (?&lt;DQUOTE>          ")
     (?&lt;WSP>             [\x20\x09])
   )

   (?&#038;address)

}x;
</pre>
<p>If that&#8217;s not clever enough for you, try <a href="http://stackoverflow.com/a/4286326/8817">Tom&#8217;s use of <code>(DEFINE)</code> to properly parse HTML</a>.</p>
<h2>Things to remember</h2>
<ul>
<li>You can reference a named subpattern with <code>(?&#038;NAME)</code>
<li>You can choose a subpattern with a condition <code>(?(condition)yes-pattern|no-pattern)</code>
<li>You can define and label subpatterns for later use with <code>(DEFINE)</code>
</ul>
<p align="left"><a class="tt" href="http://twitter.com/home/?status=Define+grammars+in+regular+expressions+http://tinyurl.com/7zo7wrl" title="Post to Twitter"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-twitter2.png" alt="Post to Twitter" /></a> <a class="tt" href="http://twitter.com/home/?status=Define+grammars+in+regular+expressions+http://tinyurl.com/7zo7wrl" title="Post to Twitter"> </a> <a class="tt" href="http://delicious.com/post?url=http://www.effectiveperlprogramming.com/blog/1479&amp;title=Define+grammars+in+regular+expressions" title="Post to Delicious"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-delicious.png" alt="Post to Delicious" /></a> <a class="tt" href="http://delicious.com/post?url=http://www.effectiveperlprogramming.com/blog/1479&amp;title=Define+grammars+in+regular+expressions" title="Post to Delicious"> </a> <a class="tt" href="http://digg.com/submit?url=http://www.effectiveperlprogramming.com/blog/1479&amp;title=Define+grammars+in+regular+expressions" title="Post to Digg"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-digg.png" alt="Post to Digg" /></a> <a class="tt" href="http://digg.com/submit?url=http://www.effectiveperlprogramming.com/blog/1479&amp;title=Define+grammars+in+regular+expressions" title="Post to Digg"> </a> <a class="tt" href="http://www.facebook.com/share.php?u=http://www.effectiveperlprogramming.com/blog/1479&amp;t=Define+grammars+in+regular+expressions" title="Post to Facebook"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-facebook.png" alt="Post to Facebook" /></a> <a class="tt" href="http://www.facebook.com/share.php?u=http://www.effectiveperlprogramming.com/blog/1479&amp;t=Define+grammars+in+regular+expressions" title="Post to Facebook"> </a> <a class="tt" href="http://reddit.com/submit?url=http://www.effectiveperlprogramming.com/blog/1479&amp;title=Define+grammars+in+regular+expressions" title="Post to Reddit"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-reddit.png" alt="Post to Reddit" /></a> <a class="tt" href="http://reddit.com/submit?url=http://www.effectiveperlprogramming.com/blog/1479&amp;title=Define+grammars+in+regular+expressions" title="Post to Reddit"> </a></p>]]></content:encoded>
			<wfw:commentRss>http://www.effectiveperlprogramming.com/blog/1479/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Use lookarounds to split to avoid special cases</title>
		<link>http://www.effectiveperlprogramming.com/blog/1411</link>
		<comments>http://www.effectiveperlprogramming.com/blog/1411#comments</comments>
		<pubDate>Sun, 16 Oct 2011 17:01:38 +0000</pubDate>
		<dc:creator>brian d foy</dc:creator>
				<category><![CDATA[item]]></category>
		<category><![CDATA[regular expressions]]></category>

		<guid isPermaLink="false">http://www.effectiveperlprogramming.com/?p=1411</guid>
		<description><![CDATA[There are some regular expression tricks that can help you deal with balanced delimiters in a string. The split command takes a pattern, removes the parts of a string that match that pattern, and give you a list of the parts of the string between those separators. Said another way, split works when the parts [...]]]></description>
			<content:encoded><![CDATA[<p>There are some regular expression tricks that can help you deal with balanced delimiters in a string. The <a href="http://perldoc.perl.org/functions/split.html">split</a> command takes a pattern, removes the parts of a string that match that pattern, and give you a list of the parts of the string between those separators. Said another way, <a href="http://perldoc.perl.org/functions/split.html">split</a> works when the parts you don&#8217;t need are between the values.</p>
<p>Single character separators are easy</p>
<pre class="brush:perl">
use v5.10;

my @letters = split /:/, 'a:b:c:d:e';
say "@letters";
</pre>
<p>The list comes out just as you expect:</p>
<pre class="brush:plain">
a b c d e
</pre>
<p>Even multiple or variable width patterns are fine:</p>
<pre class="brush:perl">
use v5.10;

my @cats = split /\s+/, 'Buster
	Mimi     Roscoe';
say "@cats";
</pre>
<p>The list comes out just as you expect:</p>
<pre class="brush:plain">
Buster Mimi Roscoe
</pre>
<p>It gets more tricky when you have balanced delimiters, when there&#8217;s something that marks the start and the end of a value. The problem is that there is something in front of the first element and something after the last element. You can&#8217;t split on the pattern of characters between the values because you don&#8217;t remove everything:</p>
<pre class="brush:perl">
use v5.10;

my @cats = split /\s+/, '&lt;Buster>&lt;Mimi>&lt;Roscoe>';
say "@cats";
</pre>
<p>The first and last delimiter characters are still attached to their values:</p>
<pre class="brush:plain">
&lt;Buster Mimi Roscoe&gt;
</pre>
<p>You might be tempted to live with that and process those values after the <a href="http://perldoc.perl.org/functions/split.html">split</a>:</p>
<pre class="brush:perl">
use v5.10;

my @cats = split />&lt;/, '&lt;Buster>&lt;Mimi>&lt;Roscoe>';
$cats[0] =~ s/&lt;//;
$cats[-1] =~ s/>//;
say "@cats";
</pre>
<p>Some people might be satisfied with that, and it does work, but it&#8217;s much better to remove the special cases. If you limit yourself to matching just the character that you want to remove, you&#8217;re a bit limited. One problem is the empty leading field that you get if you try to match the first delimiter character:</p>
<pre class="brush:perl">
use v5.10;

my @cats = split /><|\A<|>\z/, '<Buster><Mimi><Roscoe>';

say "@cats";
</pre>
<p>There&#8217;s a space at the beginning of the output because there&#8217;s an empty leading field, but the list at least doesn&#8217;t have any of the delimiter characters:</p>
<pre class="brush:plain">
 Buster Mimi Roscoe
</pre>
<p>To fix this, you still need to handle the leading field, perhaps by shifting it off. Again, this works, even if it&#8217;s unsightly:</p>
<pre class="brush:perl">
use v5.10;

my @cats = split /><|\A<|>\z/, '<Buster><Mimi><Roscoe>';
shift @cats;

say "@cats";
</pre>
<p>The special processing isn&#8217;t as bad, but you have to remember to handle that one element.</p>
<p>Instead of matching characters, you can use lookarounds to <a href="http://perldoc.perl.org/functions/split.html">split</a> on the the middle of the balanced delimiter by using a <i>zero-width assertion</i>. The lookarounds match a condition in the string but do not consume any characters. These are conditions in the string, not characters to match.</p>
<p>If you use a lookbehind next to a lookahead, you can <a href="http://perldoc.perl.org/functions/split.html">split</a> on the position in the string where both conditions match. You want to match in the middle of a <code>>&lt;</code> so the <code>&gt;</code> ends up with the preceding element and the <code>&lt;</code> stays with the succeeding element.</p>
<p>The positive lookbehind has the general form <code>(?&gt;=PATTERN)</code>. That pattern, which must be fixed-width, must match before the position. In this case, you want to match a <code>&gt;</code> before the position, so the assertion is <code>(?&gt;=&gt;)</code>.</p>
<p>The positive lookahead is almost the same thing, with the form <code>(?=PATTERN)</code>. You want to match a <code>&lt;</code> after the position, so your assertion is <code>(?=&lt;)</code>.</p>
<p>Putting them together, the lookbehind next to the lookahead, splits the values:</p>
<pre class="brush:perl">
use v5.10;

my @cats = split /(?<=>)(?=<)/, '<Buster><Mimi><Roscoe>';

say "@cats";
</pre>
<p>The output list still has the delimiter characters, but now each element needs the same processing, so there are no special cases:</p>
<pre class="brush:plain">
&lt;Buster> &lt;Mimi> &lt;Roscoe>
</pre>
<p>Once you have the values in their own elements, you can remove the delimiters:</p>
<pre class="brush:perl">
use v5.14;

my @cats =
	map { s/\A&lt;|>\z//rg }    # return the modified value
	split /(?<=>)(?=<)/,
	'<Buster><Mimi><Roscoe>';

say "@cats";
</pre>
<p>That might seem a bit silly, but we&#8217;re only using a simple example to illustrate the point.</p>
<p>Consider a slightly more complicated case, where the fields are quoted, but then separated by commas. Unless your learning to re-invent the wheel (a valid exercise to sharpen your skills), you should probably use a module (<span class="item">Item 115. Don’t use regular expressions for comma-separated values</span>). For this example, you&#8217;ll do it yourself:</p>
<pre class="brush:perl">
use v5.10;

my @cats =
	split /(?<="),(?=")/,
	'"Buster","Mimi","Roscoe"';

say "@cats";
</pre>
<p>This removes the commas, as long as they are between quotes. However, you leave the quotes in place so you don't treat the first and last values specially:</p>
<pre class="brush:plain">
"Buster" "Mimi" "Roscoe"
</pre>
<p>To get rid of the quotes, you process each item in the same way:</p>
<pre class="brush:perl">
use v5.14;

my @cats =
	map { s/\A"|"\z//rg }       # return the modified value
	split /(?<="),(?=")/,
	'"Buster","Mimi","Roscoe"';

say "@cats";
</pre>
<p>You might try to construct a more complicated regular expression to also remove the quotes, but that's going to be harder to read and maintain than doing it in two simple steps.</p>
<h2>Things to remember</h2>
<ul>
<li>You don't have to remove delimiters in one step
<li>You can use a lookbehind next to a lookahead to specify a position in a string
</ul>
<p align="left"><a class="tt" href="http://twitter.com/home/?status=Use+lookarounds+to+split+to+avoid+special+cases+http://tinyurl.com/6lp2ps5" title="Post to Twitter"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-twitter2.png" alt="Post to Twitter" /></a> <a class="tt" href="http://twitter.com/home/?status=Use+lookarounds+to+split+to+avoid+special+cases+http://tinyurl.com/6lp2ps5" title="Post to Twitter"> </a> <a class="tt" href="http://delicious.com/post?url=http://www.effectiveperlprogramming.com/blog/1411&amp;title=Use+lookarounds+to+split+to+avoid+special+cases" title="Post to Delicious"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-delicious.png" alt="Post to Delicious" /></a> <a class="tt" href="http://delicious.com/post?url=http://www.effectiveperlprogramming.com/blog/1411&amp;title=Use+lookarounds+to+split+to+avoid+special+cases" title="Post to Delicious"> </a> <a class="tt" href="http://digg.com/submit?url=http://www.effectiveperlprogramming.com/blog/1411&amp;title=Use+lookarounds+to+split+to+avoid+special+cases" title="Post to Digg"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-digg.png" alt="Post to Digg" /></a> <a class="tt" href="http://digg.com/submit?url=http://www.effectiveperlprogramming.com/blog/1411&amp;title=Use+lookarounds+to+split+to+avoid+special+cases" title="Post to Digg"> </a> <a class="tt" href="http://www.facebook.com/share.php?u=http://www.effectiveperlprogramming.com/blog/1411&amp;t=Use+lookarounds+to+split+to+avoid+special+cases" title="Post to Facebook"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-facebook.png" alt="Post to Facebook" /></a> <a class="tt" href="http://www.facebook.com/share.php?u=http://www.effectiveperlprogramming.com/blog/1411&amp;t=Use+lookarounds+to+split+to+avoid+special+cases" title="Post to Facebook"> </a> <a class="tt" href="http://reddit.com/submit?url=http://www.effectiveperlprogramming.com/blog/1411&amp;title=Use+lookarounds+to+split+to+avoid+special+cases" title="Post to Reddit"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-reddit.png" alt="Post to Reddit" /></a> <a class="tt" href="http://reddit.com/submit?url=http://www.effectiveperlprogramming.com/blog/1411&amp;title=Use+lookarounds+to+split+to+avoid+special+cases" title="Post to Reddit"> </a></p>]]></content:encoded>
			<wfw:commentRss>http://www.effectiveperlprogramming.com/blog/1411/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Use lookarounds to eliminate special cases in split</title>
		<link>http://www.effectiveperlprogramming.com/blog/1386</link>
		<comments>http://www.effectiveperlprogramming.com/blog/1386#comments</comments>
		<pubDate>Sun, 25 Sep 2011 16:30:20 +0000</pubDate>
		<dc:creator>brian d foy</dc:creator>
				<category><![CDATA[item]]></category>
		<category><![CDATA[regular expressions]]></category>

		<guid isPermaLink="false">http://www.effectiveperlprogramming.com/?p=1386</guid>
		<description><![CDATA[The split built-in takes a string and turns it into a list, discarding the separators that you specify as a pattern. This is easy when the separator is simple, but seems hard if the separator gets more tricky. For a simple example, you can split an entry from /etc/password (although getpw* functions will do that [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://perldoc.perl.org/functions/split.html">split</a> built-in takes a string and turns it into a list, discarding the separators that you specify as a pattern. This is easy when the separator is simple, but seems hard if the separator gets more tricky.</p>
<p>For a simple example, you can split an entry from <i>/etc/password</i> (although <code>getpw*</code> functions will do that for you):</p>
<pre class="brush:plain">
root:*:0:0:System Administrator:/var/root:/bin/sh
</pre>
<p>The colons separate the fields, so you split on a colon:</p>
<pre class="brush:perl">
my @fields = split /:/, $passwd_line;
</pre>
<p>That works just fine because the separator is a single character, that character is the same between each field, and the separator character doesn&#8217;t appear in any of the data.</p>
<p>A slightly more tricky example has a character from the separator also show up in the data. Consider comma-separated values which also allows a comma in the data. If you really have to do this, you would use a module (<span class="item">Item 115. Don’t use regular expressions for comma-separated values</span>). However, this is a good task to illustrate some of the tricks in this Item. You might see these data stored in many ways. You are likely to see all the fields quoted if any one of them has the comma:</p>
<pre class="brush:plain">
"Buster","Roscoe, Cat","Mimi"
</pre>
<p>You can split on <code>","</code>, which separates all the fields:</p>
<pre class="brush:perl">
my $string = q("Buster","Roscoe, Cat","Mimi");

my @fields = split /","/, $string;

$" = "\n";
print "@fields\n";
</pre>
<p>However, the first and last fields have remnants of the quoting:</p>
<pre class="brush:plain">
"Buster
Roscoe, Cat
Mimi"
</pre>
<p>In this case, the simple <a href="http://perldoc.perl.org/functions/split.html">split</a> failed because it only removes text between the fields and doesn&#8217;t care at all about text at the beginning of the string or the end of the string.</p>
<p>You might think that you can make special cases to handle the beginning and end of the string bits. Creating special cases is almost always what you want to avoid: they make the code more complicated and they make you think about more than you really need to think about. Still, you can do that with alternations in the pattern:</p>
<pre class="brush:perl">
my $string = q("Buster","Roscoe, Cat","Mimi");

my @fields = split /\A"|","|"\z/, $string;

$" = "\n";
print "@fields\n";
</pre>
<p>And, it doesn&#8217;t work. The <a href="http://perldoc.perl.org/functions/split.html">split</a> maintains leading open fields, so we get an extra field at the start:</p>
<pre class="brush:plain">

Buster
Roscoe, Cat
Mimi
</pre>
<p>You could handle that by removing the first element, but that&#8217;s more duct tape and spit over the other kludge. Not only do you have two special cases in the pattern, but you have a special case in the output.</p>
<p>You don&#8217;t have to remove the quotes right away though. You can reduce all the special cases by not matching the quote characters in the <a href="http://perldoc.perl.org/functions/split.html">split</a> pattern. You can use a <i>lookaround</i> to find the commas surrounded by quotes:</p>
<pre class="brush:perl">
my $string = q("Buster","Roscoe, Cat","Mimi");

my @fields = split /(?&lt;="),(?=")/, $string;

$" = "\n";
print "@fields\n";
</pre>
<p>The <i>positive lookbehind</i>, <code>(?<=...)</code>, is a <i>zero-width assertion</i>. It matches a pattern that exists (hence <i>positive</i>) but doesn't consume the characters it matches. You already know about other zero-width assertions, such as <code>\b</code> and <code>^</code>. These merely match a condition in the string before the pattern. The <i>positive lookahead</i>, <code>(?<=...)</code>, is the same thing, but looks forward of the pattern.</p>
<p>Now all of the fields retain their quotes because the lookarounds do not consume the characters they match, even though they assert those characters must be there:</p>
<pre class="brush:plain">
"Buster"
"Roscoe, Cat"
"Mimi"
</pre>
<p>You can easily strip off the quotes, handling every element returned by split in the same way:</p>
<pre class="brush:perl">
use v5.14;
my $string = q("Buster","Roscoe, Cat","Mimi");

my @fields =
	map { s/\A"|"\Z//gr }
	split /(?&lt;="),(?=")/, $string;

$" = "\n";
print "@fields\n";
</pre>
<p>The pattern has no special cases, and the output from <a href="http://perldoc.perl.org/functions/split.html">split</a> has no special cases. Eliminating special cases reduces the number of things you have to remember and the reduces the likelihood that you'll mess up one of the cases.</p>
<pre class="brush:plain">
Buster
Roscoe, Cat
Mimi
</pre>
<p>What if the separator where even more complex, with a literal quote mark inside the data? If you can do that, you can imagine a quote character next to a comma in the field:</p>
<pre class="brush:plain">
"Buster","Roscoe "","" Cat","Mimi"
</pre>
<p>Now you want to split on a comma with quotes around it, but only if it doesn't have two consecutive quotes on either side. You can combine the positive lookarounds with <i>negative lookarounds</i>. The negative versions act the same, but assert that the condition cannot match, just like a <code>\B</code> asserts that the position is not a word boundary:</p>
<pre class="brush:perl">
use v5.14;
my $string = q("Buster","Roscoe "","" Cat","Mimi");

my @fields =
	map { s/"(?=")//gr }
	map { s/\A"|"\z//gr }
	split /(?&lt;!"")(?&lt;="),(?=")(?!"")/, $string;

$" = "\n";
print "@fields\n";
</pre>
<p>In processing the <code>""</code>, you use another positive lookahead to unescape the doubled double quote character:</p>
<pre class="brush:plain">
Buster
Roscoe "," Cat
Mimi
</pre>
<p>As a final example, instead of quoted fields, you might see the non-separator comma as an escaped character: </p>
<pre class="brush:plain">
Buster,Roscoe\, Cat,Mimi
</pre>
<p>In this case, you only want to split on a comma that does <i>not</i> have an escape character before it. You can't use a positive lookbehind because you don't want to match characters before the comma. Instead, you want a negative lookbehind because you want to assert that there are characters that can't appear before the comma. Instead of a <code>=</code>, you use a <code>!</code>:</p>
<pre class="brush:perl">
use v5.14;
my $string = q(Buster,Roscoe\\, Cat,Mimi);

my @fields =
	map { s/\\(?=,)//gr }
	split /(?&lt;!\\),/, $string;

$" = "\n";
print "@fields\n";
</pre>
<p>Again, you use another positive lookahead, <code>(?=,)</code>, in the <code>s///</code> so you substitution pattern does not match the character that you don't want to replace. Otherwise, you'd have to type the comma twice: </p>
<pre class="brush:perl">
s/\\,/,/gr
</pre>
<p>You can go even further with these examples, creating much more ugly and complex examples with additional levels of quoting. This should naturally lead you to believe that regular expressions aren't the best tool for this (or at least a single regular expression).</p>
<h2>Things to remember</h2>
<ul>
<li>If you really have to parse comma-separated values, use a module instead of writing your own patterns
<li>Lookarounds assert a condition in the string without consuming any characters
<li>The positive lookarounds assert their patterns must match
<li>The negative lookarounds assert their pattern must not match
<li>Use the lookarounds to eliminate special cases in complex split patterns
</ul>
<p align="left"><a class="tt" href="http://twitter.com/home/?status=Use+lookarounds+to+eliminate+special+cases+in+split+http://tinyurl.com/8axe6ce" title="Post to Twitter"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-twitter2.png" alt="Post to Twitter" /></a> <a class="tt" href="http://twitter.com/home/?status=Use+lookarounds+to+eliminate+special+cases+in+split+http://tinyurl.com/8axe6ce" title="Post to Twitter"> </a> <a class="tt" href="http://delicious.com/post?url=http://www.effectiveperlprogramming.com/blog/1386&amp;title=Use+lookarounds+to+eliminate+special+cases+in+split" title="Post to Delicious"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-delicious.png" alt="Post to Delicious" /></a> <a class="tt" href="http://delicious.com/post?url=http://www.effectiveperlprogramming.com/blog/1386&amp;title=Use+lookarounds+to+eliminate+special+cases+in+split" title="Post to Delicious"> </a> <a class="tt" href="http://digg.com/submit?url=http://www.effectiveperlprogramming.com/blog/1386&amp;title=Use+lookarounds+to+eliminate+special+cases+in+split" title="Post to Digg"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-digg.png" alt="Post to Digg" /></a> <a class="tt" href="http://digg.com/submit?url=http://www.effectiveperlprogramming.com/blog/1386&amp;title=Use+lookarounds+to+eliminate+special+cases+in+split" title="Post to Digg"> </a> <a class="tt" href="http://www.facebook.com/share.php?u=http://www.effectiveperlprogramming.com/blog/1386&amp;t=Use+lookarounds+to+eliminate+special+cases+in+split" title="Post to Facebook"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-facebook.png" alt="Post to Facebook" /></a> <a class="tt" href="http://www.facebook.com/share.php?u=http://www.effectiveperlprogramming.com/blog/1386&amp;t=Use+lookarounds+to+eliminate+special+cases+in+split" title="Post to Facebook"> </a> <a class="tt" href="http://reddit.com/submit?url=http://www.effectiveperlprogramming.com/blog/1386&amp;title=Use+lookarounds+to+eliminate+special+cases+in+split" title="Post to Reddit"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-reddit.png" alt="Post to Reddit" /></a> <a class="tt" href="http://reddit.com/submit?url=http://www.effectiveperlprogramming.com/blog/1386&amp;title=Use+lookarounds+to+eliminate+special+cases+in+split" title="Post to Reddit"> </a></p>]]></content:encoded>
			<wfw:commentRss>http://www.effectiveperlprogramming.com/blog/1386/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Set default regular expression modifiers</title>
		<link>http://www.effectiveperlprogramming.com/blog/1063</link>
		<comments>http://www.effectiveperlprogramming.com/blog/1063#comments</comments>
		<pubDate>Sun, 19 Jun 2011 17:34:28 +0000</pubDate>
		<dc:creator>brian d foy</dc:creator>
				<category><![CDATA[5.14]]></category>
		<category><![CDATA[item]]></category>
		<category><![CDATA[regular expressions]]></category>

		<guid isPermaLink="false">http://www.effectiveperlprogramming.com/?p=1063</guid>
		<description><![CDATA[Are you tired of adding the same modifiers to all of your regular expressions? For instance, if you might always add the /u modifier to turn on Unicode semantics on all of your patterns, including qr//, m//, and s///. Instead of remembering to do that to every pattern, the re that ships with Perl 5.14 [...]]]></description>
			<content:encoded><![CDATA[<p>Are you tired of adding the same modifiers to all of your regular expressions? For instance, if you might always add the <code class="regex">/u</code> modifier to turn on Unicode semantics on all of your patterns, including <code class="builtin">qr//</code>, <code class="builtin">m//</code>, and <code class="builtin">s///</code>. Instead of remembering to do that to every pattern, the <code class="module">re</code> that ships with Perl 5.14 now lets you do that for all patterns in the current lexical scope. You can also turn <i>off</i> a modifier for the rest of the scope.</p>
<p>You can use any modifier that affects the pattern, but not the modifiers that affect the operator (see <a href="http://www.effectiveperlprogramming.com/blog/174">Know the difference between regex and match operator flags</a>). Try an example with an easier modifier. The <code class="regex">/i</code> modifier makes the pattern case insensitive. Instead of adding that for all of your match operations, you use the <code class="module">re</code> pragma&#8217;s <i>flags</i> mode. In this example, you use the <a href="http://www.effectiveperlprogramming.com/blog/95">Test::More module to experiment with new ideas</a>. </p>
<p>First, write some tests that you expect to fail. Since the pattern is all lowercase, but the target string has an uppercase letter, these should fail:</p>
<pre class="brush:perl">
use 5.014;
use Test::More;

like( 'Buster', qr/buster/, 'Buster matches with qr//' );
ok( 'Buster' =~ m/buster/, 'Buster matches with m//' );
done_testing();
</pre>
<p>And they do fail. That&#8217;s a good thing, because you want to magically make them pass by adding a default modifier:</p>
<pre class="brush:plain">
not ok 1 - Buster matches with case insensitivity
not ok 2 - Buster matches with m//
1..2
#   Failed test 'Buster matches with case insensitivity'
#                   'Buster'
#     doesn't match '(?^u:buster)'
#   Failed test 'Buster matches with m//'
# Looks like you failed 2 tests of 2.
</pre>
<p>Now, add the default modifiers. You add those through the import list for the <code class="module">re</code> module. The list of modifiers starts with a slash to distinguish it from other imports. To make the <code class="regex">/i</code> the default, that&#8217;s exactly what you import:</p>
<pre class="brush:perl">
use 5.014;
use Test::More;

use re '/i';
like( 'Buster', qr/buster/, 'Buster matches with case insensitivity' );
ok( 'Buster' =~ m/buster/, 'Buster matches with m//' );

done_testing();
</pre>
<p>Now the tests pass because they are case insensitive:</p>
<pre class="brush:plain">
ok 1 - Buster matches with case insensitivity
ok 2 - Buster matches with m//
1..2
</pre>
<p>These default modifiers are only lexically scoped, and that&#8217;s how you should use them. You don&#8217;t want to change more than you intend, and the next programmer who comes along might not realize that you set the default modifiers at the top of the file. Try it with a lexical scope to check that it&#8217;s limited to that scope (see <a href="http://www.effectiveperlprogramming.com/blog/48">Know what creates a scope</a>):</p>
<pre class="brush:perl">
use 5.014;
use Test::More;

SCOPE: {
	use re '/i';
	like( 'Buster', qr/buster/, 'Buster matches with case insensitivity' );
	ok( 'Buster' =~ m/buster/, 'Buster matches with m//' );
	outside_scope();
	}

sub outside_scope {
	unlike( 'Buster', qr/buster/, 'Buster does not match with case insensitivity' );
	ok( !( 'Buster' =~ m/buster/ ), 'Buster does not match with m//' );
	}

done_testing();
</pre>
<p>Now that you see that the default modifier is limited to the lexical scope.</p>
<pre class="brush:plain">
ok 1 - Buster matches with case insensitivity
ok 2 - Buster matches with m//
ok 3 - Buster does not match with case insensitivity
ok 4 - Buster does not match with m//
1..4
</pre>
<p>So far, you&#8217;ve used only one modifier as the default, but you can stack them just like you would with <code class="builtin">qr//</code> or <code class="builtin">m//</code> or <code class="builtin">s///</code>. Suppose you want to turn on both <code class="regex">/i</code> and <code class="regex">/s</code> at the same time so you get case insensitivity and let the <code class="regex">.</code> match a newline:</p>
<pre class="brush:perl">
use 5.014;
use Test::More;

use re '/is';
like( "Bu\nter", qr/bu.ter/, 'Bu\\nter matches with case insensitivity' );
ok( 'Buster' =~ m/bu.ter/, 'Buster matches with m//' );

done_testing();
</pre>
<p>Both of those work their magic as default values:</p>
<pre class="brush:plain">
ok 1 - Bu\nter matches with case insensitivity
ok 2 - Buster matches with m//
1..2
</pre>
<p>You don&#8217;t have to stack them, though. You can specify them separately and it works just as well, although each group must start with a <code>/</code>:</p>
<pre class="brush:perl">
use re '/i', '/s';

use re qw(/i /s);
</pre>
<h3>Turn off default modifiers</h3>
<p>Once turned on, these modifiers apply to all the patterns in the pragma&#8217;s scope, but if you don&#8217;t want an enabled modifier in a pattern. Suppose, for instance, that one part of the pattern absolutely should not be case insensitive. You can <a href="http://www.effectiveperlprogramming.com/blog/735"><use the <code>(?^:)</code> sequence to turn off modifiers for a subpattern</a>:</p>
<pre class="brush:perl">
use 5.014;
use Test::More;
use re '/i';

foreach my $string ( qw(Buster bUSTER buster BuStEr) ) {
	say "$string matches" if  $string =~ /(?^:B)uster/;
	}
</pre>
<p>The output shows that only the strings starting with an uppercase <i>B</i> match because the <code class="regex">(?^:B)</code> portion turns off all modifiers for that subpattern. You should consider using <code>(?^:)</code> if you are also going to have default flags.</p>
<h2>Things to remember</h2>
<ul>
<li>Set default regular expression modifiers with the <code class="pragma">re</code> pragma.
<li>You can only use modifiers that apply to the pattern, not the operator.
<li>You can stack multiple modifiers in a single import string, such as <code class="regex">/is</code>.
<li>Turn off modifiers for a subpattern with <code>(?^:)</code>.
</ul>
<p align="left"><a class="tt" href="http://twitter.com/home/?status=Set+default+regular+expression+modifiers+http://tinyurl.com/5ufhfvg" title="Post to Twitter"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-twitter2.png" alt="Post to Twitter" /></a> <a class="tt" href="http://twitter.com/home/?status=Set+default+regular+expression+modifiers+http://tinyurl.com/5ufhfvg" title="Post to Twitter"> </a> <a class="tt" href="http://delicious.com/post?url=http://www.effectiveperlprogramming.com/blog/1063&amp;title=Set+default+regular+expression+modifiers" title="Post to Delicious"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-delicious.png" alt="Post to Delicious" /></a> <a class="tt" href="http://delicious.com/post?url=http://www.effectiveperlprogramming.com/blog/1063&amp;title=Set+default+regular+expression+modifiers" title="Post to Delicious"> </a> <a class="tt" href="http://digg.com/submit?url=http://www.effectiveperlprogramming.com/blog/1063&amp;title=Set+default+regular+expression+modifiers" title="Post to Digg"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-digg.png" alt="Post to Digg" /></a> <a class="tt" href="http://digg.com/submit?url=http://www.effectiveperlprogramming.com/blog/1063&amp;title=Set+default+regular+expression+modifiers" title="Post to Digg"> </a> <a class="tt" href="http://www.facebook.com/share.php?u=http://www.effectiveperlprogramming.com/blog/1063&amp;t=Set+default+regular+expression+modifiers" title="Post to Facebook"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-facebook.png" alt="Post to Facebook" /></a> <a class="tt" href="http://www.facebook.com/share.php?u=http://www.effectiveperlprogramming.com/blog/1063&amp;t=Set+default+regular+expression+modifiers" title="Post to Facebook"> </a> <a class="tt" href="http://reddit.com/submit?url=http://www.effectiveperlprogramming.com/blog/1063&amp;title=Set+default+regular+expression+modifiers" title="Post to Reddit"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-reddit.png" alt="Post to Reddit" /></a> <a class="tt" href="http://reddit.com/submit?url=http://www.effectiveperlprogramming.com/blog/1063&amp;title=Set+default+regular+expression+modifiers" title="Post to Reddit"> </a></p>]]></content:encoded>
			<wfw:commentRss>http://www.effectiveperlprogramming.com/blog/1063/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Find dates with Regexp::Common</title>
		<link>http://www.effectiveperlprogramming.com/blog/1002</link>
		<comments>http://www.effectiveperlprogramming.com/blog/1002#comments</comments>
		<pubDate>Wed, 16 Feb 2011 20:20:08 +0000</pubDate>
		<dc:creator>brian d foy</dc:creator>
				<category><![CDATA[midweek bonus item]]></category>
		<category><![CDATA[regular expressions]]></category>

		<guid isPermaLink="false">http://www.effectiveperlprogramming.com/?p=1002</guid>
		<description><![CDATA[[This is a mid-week bonus item] Suppose you want to find some dates inside a big string. The problem with dates is that there are some many ways to write them, and even if you can come up with a pattern to get the structure right, can you handle the different locales and languages that [...]]]></description>
			<content:encoded><![CDATA[<p>[<i>This is a mid-week bonus item</i>]</p>
<p>Suppose you want to find some dates inside a big string. The problem with dates is that there are some many ways to write them, and even if you can come up with a pattern to get the structure right, can you handle the different locales and languages that use different words to refer to the same day or month?</p>
<p>In <span class="item">Item 42. Don&#8217;t reinvent the regex</span>, you saw the <a class="external cpan" href="http://search.cpan.org/dist/Regexp-Common">Regexp::Common</a> module. It creates the regular expressions that many people often get wrong because they miss some subtle part of the pattern.</p>
<p><a class="external cpan" href="http://search.cpan.org/dist/Regexp-Common-time">Regexp::Common::time</a>&#8216;s date handling is quite amazing though. It&#8217;s a plugin, so you need to install it separately. Instead of specifying a regular expression, you can use the <code>-pat</code> option to specify the <i>structure</i> of the date, using a string much like that for <code>strftime</code>, although with some regular expression bits added. From the semi-pattern, it constructs a much more complicated pattern that does the right thing. Since the module gives you a regex object, you can print it to see the pattern:</p>
<p>In this example, you extract the </p>
<pre class="brush:perl">
use Regexp::Common qw(time);

my @lines = `ls -l`;

# May  3  2010
# Jan 17 18:21
$date_re = $RE{time}{strftime}{
	-pat => '%b\s+%_d\s+(?:%Y|%_H:%M)'
	};

print "Pattern is------\n$date_re\n-------\n";
</pre>
<p>This pattern reflects the national representation for the en_US locale:</p>
<pre class="brush:plain">
Pattern is------
(?=[SAFOJNMD])(?&gt;Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+(?:0[1-9]|[12]\d|3[01]|(?&lt;!\d)[1-9])\s+(?:\d{4}|(?:(?=\d)(?:[01]\d|2[0123]|(?&lt;!\d)\d)):(?:[0-5]\d))
-------
</pre>
<p>You can change your locale, in this case, to tr_TR for Turkish, to get a different pattern that has the same structure, although I don&#8217;t know if the Turks write their dates like this:</p>
<pre class="brush:plain">
Pattern is------
(?=[AOTNKEHM\Å])(?>Oca|\Å\ub|Mar|Nis|May|Haz|Tem|A\Ä\u|Eyl|Eki|Kas|Ara)\s+(?:0[1-9]|[12]\d|3[01]|(?&lt;!\d)[1-9])\s+(?:\d{4}|(?:(?=\d)(?:[01]\d|2[0123]|(?&lt;!\d)\d)):(?:[0-5]\d))
-------
</pre>
<p>You can now use this pattern to match dates in text. Here&#8217;s a program that takes in a line and puts <code class="string">^</code> characters under the parts it thinks are dates:</p>
<pre class="brush:perl">
use Regexp::Common qw(time);

my @lines = `ls -l`;

# May  3  2010
# Jan 17 18:21
$date_re = $RE{time}{strftime}{
	-pat => '%b\s+%_d\s+(?:%Y|%_H:%M)'
	};

while( defined( my $line = &lt;> ) {
	next unless $line =~ /$date_re/;
	my $start = $-[0];
	my $stop  = $+[0];

	my $underline = ( ' ' x $-[0] ) . ( '^' x ($stop - $start) );

	print $line;
	print $underline, "\n\n";
	}
</pre>
<p>You can test this by piping some output into this program. Here&#8217;s an extract of output from the Unix <code class="binary">ls</code> command. Notice that the first date has a time instead of a year, but you still find it:</p>
<pre class="brush:plain">
$ ls -l /usr/local/perls/perl-5.10.1/lib/site_perl/5.10.1 | perl date_finder.pl
drwxr-xr-x   4 brian  wheel    136 Dec  9 01:58 Acme
                                   ^^^^^^^^^^^^

-r--r--r--   1 brian  wheel  32517 Jul  6  2007 AppConfig.pm
                                   ^^^^^^^^^^^^

-r--r--r--   1 brian  wheel  54725 Jul 19  2007 Expect.pm
                                   ^^^^^^^^^^^^

-r--r--r--   1 brian  wheel  43735 Jul 19  2007 Expect.pod
                                   ^^^^^^^^^^^^

drwxr-xr-x   3 brian  wheel    102 May 16  2010 ExtUtils
                                   ^^^^^^^^^^^^

drwxr-xr-x   3 brian  wheel    102 Jun 17  2010 local
                                   ^^^^^^^^^^^^

-r--r--r--   1 brian  wheel   9137 Jun 15  2009 lwpcook.pod
                                   ^^^^^^^^^^^^

-r--r--r--   1 brian  wheel  25447 Jun 15  2009 lwptut.pod
                                   ^^^^^^^^^^^^

drwxr-xr-x   4 brian  wheel    136 May 28  2010 namespace
                                   ^^^^^^^^^^^^

-r--r--r--   1 brian  wheel   1931 Sep 22  2009 oose.pm
                                   ^^^^^^^^^^^^
</pre>
<p>Notice that this would be hard to do with <code class="builtin">split</code> if you run into filenames that have spaces. You can&#8217;t depend on fixed column widths because the file sizes can move things around. It turns out to be pretty annoying.</p>
<p align="left"><a class="tt" href="http://twitter.com/home/?status=Find+dates+with+Regexp%3A%3ACommon+http://tinyurl.com/6gon4ea" title="Post to Twitter"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-twitter2.png" alt="Post to Twitter" /></a> <a class="tt" href="http://twitter.com/home/?status=Find+dates+with+Regexp%3A%3ACommon+http://tinyurl.com/6gon4ea" title="Post to Twitter"> </a> <a class="tt" href="http://delicious.com/post?url=http://www.effectiveperlprogramming.com/blog/1002&amp;title=Find+dates+with+Regexp%3A%3ACommon" title="Post to Delicious"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-delicious.png" alt="Post to Delicious" /></a> <a class="tt" href="http://delicious.com/post?url=http://www.effectiveperlprogramming.com/blog/1002&amp;title=Find+dates+with+Regexp%3A%3ACommon" title="Post to Delicious"> </a> <a class="tt" href="http://digg.com/submit?url=http://www.effectiveperlprogramming.com/blog/1002&amp;title=Find+dates+with+Regexp%3A%3ACommon" title="Post to Digg"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-digg.png" alt="Post to Digg" /></a> <a class="tt" href="http://digg.com/submit?url=http://www.effectiveperlprogramming.com/blog/1002&amp;title=Find+dates+with+Regexp%3A%3ACommon" title="Post to Digg"> </a> <a class="tt" href="http://www.facebook.com/share.php?u=http://www.effectiveperlprogramming.com/blog/1002&amp;t=Find+dates+with+Regexp%3A%3ACommon" title="Post to Facebook"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-facebook.png" alt="Post to Facebook" /></a> <a class="tt" href="http://www.facebook.com/share.php?u=http://www.effectiveperlprogramming.com/blog/1002&amp;t=Find+dates+with+Regexp%3A%3ACommon" title="Post to Facebook"> </a> <a class="tt" href="http://reddit.com/submit?url=http://www.effectiveperlprogramming.com/blog/1002&amp;title=Find+dates+with+Regexp%3A%3ACommon" title="Post to Reddit"><img class="nothumb" src="http://www.effectiveperlprogramming.com/wp-content/plugins/tweet-this/icons/tt-reddit.png" alt="Post to Reddit" /></a> <a class="tt" href="http://reddit.com/submit?url=http://www.effectiveperlprogramming.com/blog/1002&amp;title=Find+dates+with+Regexp%3A%3ACommon" title="Post to Reddit"> </a></p>]]></content:encoded>
			<wfw:commentRss>http://www.effectiveperlprogramming.com/blog/1002/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

