Define grammars in regular expressions

[ This is the 100th Item we’ve shared with you in the two years this blog has been around. We deserve a holiday and we’re taking it, so read us next year! Happy Holidays.]

Perl 5.10 added rudimentary grammar support in its regular expressions. You could define many subpatterns directly in your pattern, use them to define larger subpatterns, and, finally, when you have everything in place, let Perl do the work.

There are other ways, some more powerful, that let you do the same thing. This Item is not about those, however, but you can read about Regex::Grammars or Parse::RecDescent on your own. Also, you’re not going to get much of a recommendation of which one you should use for your task. We don’t know your situation.

To understand this new syntax, you have to study it from the ground up. It’s not simple, and the terse documentation in perlre doesn’t do much to help.

Referencing a subpattern

The first part you need is the ability to call a named part of the pattern (to label a subpattern, see Item 31. Use named captures to label matches). To re-match a labeled subpattern, you use:

(?&NAME)

You can use that syntax to rerun a subpattern later:

use v5.10;

my $pattern = qr/
	(?<cat>Buster|Mimi)
	\s+
	(?&cat)
	/x;

foreach ( 'Buster Mimi', 'Mimi Buster', 'Buster', 'Buster Buster' ) {
	say "$_ ", m/$pattern/p ? "matched" : 'nope!';
	}

The labeled subpattern has an alternation where either cat name can match. When you reference it again, you re-run the alternation and you can match either cat name again:

Buster Mimi matched
Mimi Buster matched
Buster nope!
Buster Buster matched

This is not that same thing as matching the same text a labeled capture group already matched. That’s the \k<NAME>:

\k<NAME>

This pattern is a different beast. Whichever cat name matches first also has to match second:

use v5.10;

my $pattern = qr/
	(?<cat>Buster|Mimi)
	\s+
	\k<cat>
	/x;

foreach ( 'Buster Mimi', 'Mimi Buster', 'Buster', 'Buster Buster' ) {
	say "$_ ", m/$pattern/p ? "matched" : 'nope!';
	}

Now only one of the strings matches because only one string repeats a cat’s name:

Buster Mimi nope!
Mimi Buster nope!
Buster nope!
Buster Buster matched

Although you won’t see it here, the (?&NAME) syntax is the trick to matching a recursive pattern since the reference can appear inside the pattern it references.

Conditional match

The second building block you need starts with a conditional submatch:

(?(condition)yes-pattern|no-pattern)

That condition can be many things, most of which won’t appear in this Item. Although you see the | character, but this isn’t an alteration. It’s like an alternation because the | separates distinct subpatterns, but unlike an alternation because this will only ever try one of the subpatterns and you only get two subpatterns.

The simplest condition is just an ordinal number, which is true only if that capture group matched. Here’s a pattern that has two capture groups:

use v5.10;

my $pattern = qr/
	(?:           # parens for grouping
		(B)     # $1
		|         # alternation
		(M)     # $2
	)     
	(?(1)uster|imi) # conditional match
	/x;

foreach ( qw(Mimi Buster Muster Bimi Roscoe) ) {
	say "$_ ", m/$pattern/p ? "matched ${^MATCH}" : 'nope!';
	}

In this pattern, if the (B) matches, the conditional uses uster from the conditional. Otherwise, it uses imi. However, the only thing that can match besides a (B) is the other part of the alteration, the (M). The output shows that only Mimi or Buster matches:

Mimi matched Mimi
Buster matched Buster
Muster nope!
Bimi nope!
Roscoe nope!

You get the same results if you use (2) as the condition and re-arrange the order of the patterns:

my $pattern = qr/
	(?:(B)|(M))
	(?(2)imi|uster)
	/x;

Putting it together

The condition can also be the literal (DEFINE). In that case, Perl only allows a yes-branch. And, as its condition implies, it merely defines the patterns and does not execute them.

This means that you can create and label the subpatterns that you need, but not actually assert that any of them match the string. The definitions are just there. This pattern defines and labels three subpatterns then uses none of them:

use v5.10;

my $pattern = qr/
	(?(DEFINE)
		(?<cat>Buster)
		(?<dog>Addie)
		(?<bird>Poppy)
	)
	Mimi
	/x;

foreach ( 'Buster Mimi', 'Roscoe', 'Buster', 'Mimi' ) {
	say "$_ ", m/$pattern/ ? "matched" : 'nope!';
	}

It’s as if the DEFINE bit is not even there:

Buster Mimi matched
Roscoe nope!
Buster nope!
Mimi matched

Outside the (DEFINE), you can reference any of the subpatterns that you created:

use v5.10;

my $pattern = qr/
	(?(DEFINE)
		(?<cat>Buster)
		(?<dog>Addie)
		(?<bird>Poppy)
	)
	(?&cat)
	/x;

foreach ( 'Buster Mimi', 'Roscoe', 'Buster', 'Mimi' ) {
	say "$_ ", m/$pattern/ ? "matched" : 'nope!';
	}

Now Buster matches because you reference that defined subpattern:

Buster Mimi matched
Roscoe nope!
Buster matched
Mimi nope!

Now it’s time for the grammar. Inside the (DEFINE), you can reference subpatterns you haven’t defined yet, and your subpatterns can get arbitrarily complex:

use v5.10;

my $pattern = qr/
	(?(DEFINE)
		(?<male> Buster | Roscoe )
		(?<female> Mimi | Juliet )
		(?<cat> (?&male) | (?&female) )
		(?<dog>Addie)
		(?<bird>Poppy)
	)
	(?&cat)
	/x;

foreach ( 'Addie', 'Roscoe', 'Buster', 'Mimi' ) {
	say "$_ ", m/$pattern/ ? "matched" : 'nope!';
	}

Even though the cat names are in two different subpatterns, the cat subpattern unifies them so all the cat names match:

Addie nope!
Roscoe matched
Buster matched
Mimi matched

You should now be able to understand this regular expression from Tom Christainsen (appearing Stackoverflow). You might have to pick it apart, but you know how all the parts fit together to match the Internet Message Format defined in RFC 5322:

$rfc5322 = qr{

   (?(DEFINE)

     (?<address>         (?&mailbox) | (?&group))
     (?<mailbox>         (?&name_addr) | (?&addr_spec))
     (?<name_addr>       (?&display_name)? (?&angle_addr))
     (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
     (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
     (?<display_name>    (?&phrase))
     (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)

     (?<addr_spec>       (?&local_part) \@ (?&domain))
     (?<local_part>      (?&dot_atom) | (?"ed_string))
     (?<domain>          (?&dot_atom) | (?&domain_literal))
     (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                   \] (?&CFWS)?)
     (?<dcontent>        (?&dtext) | (?"ed_pair))
     (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])

     (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
     (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
     (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
     (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)

     (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
     (?<quoted_pair>     \\ (?&text))

     (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
     (?<qcontent>        (?&qtext) | (?"ed_pair))
     (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                          (?&FWS)? (?&DQUOTE) (?&CFWS)?)

     (?<word>            (?&atom) | (?"ed_string))
     (?<phrase>          (?&word)+)

     # Folding white space
     (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
     (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
     (?<ccontent>        (?&ctext) | (?"ed_pair) | (?&comment))
     (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
     (?<CFWS>            (?: (?&FWS)? (?&comment))*
                         (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

     # No whitespace control
     (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])

     (?<ALPHA>           [A-Za-z])
     (?<DIGIT>           [0-9])
     (?<CRLF>            \x0d \x0a)
     (?<DQUOTE>          ")
     (?<WSP>             [\x20\x09])
   )

   (?&address)

}x;

If that’s not clever enough for you, try Tom’s use of (DEFINE) to properly parse HTML.

Things to remember

  • You can reference a named subpattern with (?&NAME)
  • You can choose a subpattern with a condition (?(condition)yes-pattern|no-pattern)
  • You can define and label subpatterns for later use with (DEFINE)

One thought on “Define grammars in regular expressions”

  1. In the above perl regexp for an email address
    you have

    (?           (?&ALPHA) | (?&DIGIT) | 
    [!#\$%&'*+-/=?^_`{|}~])
    

    I suspect that you want instead

    (?           (?&ALPHA) | (?&DIGIT) | 
    [!#\$%&'*+\-/=?^_`{|}~])
    

    I suspect that “+-/” expands to “+,-./”
    which is probably not what you want.

Comments are closed.