Use branch reset grouping to number captures in alternations

Perl’s regular expressions have a simple rule for capturing groups. It counts the order of left parentheses to assign capture variables. Not all capture groups must actually match parts of the string, and Perl doesn’t care if they do. Perl assigns capture groups inside an alternation consecutively, even though it knows that only one branch of the alternation will match. Perl 5.10 adds the branch reset, (?|alternation) which mitigates that, though.

How many captures will a particular pattern produce? Can you tell just by looking at the pattern? How much does the particular string matter? How many capture groups are in this pattern:

(Buster)|(Mimi)|(Ella)

There are three capture groups. Only one of them is going to capture because each group is in a different branch of the alternation. What capture variables will that pattern set?

String Triggered groups $1 $2 $3
Buster (Buster) Buster undef undef
Mimi (Mimi) undef Mimi undef
Ella (Ella) undef undef Ella
Buster Mimi Ella (Buster) Buster undef undef

No matter which string you match against this pattern, you’ll also set at least three of the capture variables, and two of those will be undefined.

Perl 5.10 introduces the branch reset pattern, (?|alternation). You use that so that Perl numbers the capture buffers from the same starting point for each branch in the alternation. Instead of creating three capture buffers in your alternation, you can create just one buffer for this pattern:

(?|(Buster)|(Mimi)|(Ella))

The three capture groups in this pattern populate the same buffer:

String Triggered groups $1
Buster (Buster) Buster
Mimi (Mimi) Mimi
Ella (Ella) Ella
Buster Mimi Ella (Buster) Buster

This is more important when the alternation is in the middle of a larger pattern and there are additional capture groups after the alternation:

(?|(Buster)|(Mimi)|(Ella))(Ginger)

That’s a bit easier to read with extended patterns (Item 37: Make regular expressions readable):

(?|             # $1
	(Buster) |
	(Mimi)   |
	(Ella)
)
(               # $2
	Ginger
)

No matter how many branches you add to the alternation, the group for Ginger is always $2:

(?|             # $1
	(Buster) |
	(Mimi)   |
	(Ella)   |
	(Roscoe)
)
(               # $2
	Ginger
)

That doesn’t mean that the numbering after the alternation is always the same though. Not every branch must have the same number of captures, but the pattern reset grouping always takes up the number of buffers in the branch with the most capture groups even if that’s not the branch that matches. Consider this pattern where one of the branches has two capture groups:

(?|
	(Buster)       |  # $1, $2 is undef
	(Mimi)(Roscoe) |  # $1, $2
	(Ella)            # $1, $2 is undef
)
(                     # $3
	Ginger
)

The $1 variable is always the first capture group of whichever branch matched:

String Triggered groups $1
BusterGinger (Buster) Buster
MimiRoscoeGinger (Mimi) Mimi

The branch reset can cause problems with named captures (Item 31: Use named captures to label matches), which are really just aliases the the numbered captured variables. Labeling each capture group doesn’t do what you might expect:

(?|
	(?<cat1>Buster)              |  # $1, $2 is undef
	(?<cat2>Mimi)(?<cat3>Roscoe) |  # $1, $2
	(?<cat4>Ella)                   # $1, $2 is undef
)
(?<cat5>
	Ginger
)

Each label is just an alias to its numbered capture variable:

Label Aliased to
cat1 $1
cat2 $1
cat3 $2
cat4 $1
cat5 $3

The labels don’t apply to the groups you think they do:

String $1 $2 cat1 cat2 cat3 cat4
BusterGinger Buster undef Buster Buster undef Buster
EllaGinger Ella undef Ella Ella undef Ella
MimiRoscoeGinger Mimi Roscoe Mimi Mimi Roscoe Mimi

You should probably use the same labels in each branch and order them the same so you get the results that you expect:

(?|
	(?<cat1>Buster)              |  # $1, $2 is undef
	(?<cat1>Mimi)(?<cat2>Roscoe) |  # $1, $2
	(?<cat1>Ella)                   # $1, $2 is undef
)
(?<cat3>
	Ginger
)

Things to remember

  • Perl numbers capture groups by counting the literal order of left parentheses
  • Every capture group in an alternation creates a capture buffer
  • The pattern reset grouping, (?|...) restarts the buffer numbering for each branch of the alternation
  • Label captures in alternations with the same labels in the same order

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Leave a comment

2 Comments.

  1. I was playing with branch reset today, and there’s one thing I can’t understand. Please take a look at this code:

    my $string = "Wilma and Fred went bowling today";
    
    if($string =~ /(?|(Wilma)|(Betsey)|(Helen)) and (Barney)|(Fred) went bowling/){
        print "\$1 is $1\n\$2 is $2\n\$3 is $3\n";
    }
    

    I expected the result would be

    1 is Wilma
    2 is Fred
    3 is
    

    But instead it’s

    $1 is
    $2 is
    $3 is Fred
    

    Could you tell my why I didn’t meet my expectation? Why is $1 undef (although if the string was Wilma and Barney it would be OK)?

Leave a Reply

You must be logged in to post a comment.