Turn capture groups into cluster groups

Perl v5.22 adds the /n regex flag that turns all parentheses groups in its scope into non-capturing groups. This can be handy when you want to capture almost nothing but still need to many cluster parts. You do less typing to get that.

(The language for parentheses groups changed in Perl 5.004 to “capturing” and “clustering”. The captures set the match variables. The clusters merely group parts of the pattern. Sometimes the docs said “buffers” and sometimes “groups”.)

Perl already has non-capturing parentheses with the (?:PATTERN) syntax available in all versions of Perl 5. If you have several of those clusters but no captures, you might be annoyed while typing all those ?: characters:

/(?:abc)+(?:def){2}(?:xyz)+/

You could leave off the ?: and let those capture, but that’s a bit slower since Perl has to move data around and assign values.

/(abc)+(def){2}(xyz)+/  # slower

With the /n, these are equivalent:

/(?:abc)+(?:def){2}(?:xyz)+/
/(abc)+(def){2}(xyz)+/n

You can apply /n to parts of a pattern just as you can other pattern operators (see Know the difference between regex and match operator flags). The (?flags:SUBPATTERN) works on the subpattern:

use v5.22;

my $string = "abc1234defdefxyzABCDEf";

$string =~ /
	abc 
	(\w{3,5}) # capture this in $1 

	(?n:    # only clusters in here
		(abc)+|(def){2}|(hij)+
	)

	xyz
	(\w{3,5}) # capture this in $2 
	/x;

say "\$1: $1\n\$2: $2";

There’s a capture before the middle and another capture after. The bare parentheses in the middle (the ones in the alternation) don’t count for the numbering:

$1: 1234
$2: ABCDE

You can interpolate a pre-compiled pattern into the larger pattern and it still works (but recall that precompiled patterns retain their settings and don’t change the larger pattern’s settings):

use v5.22;

my $string = "abc1234defdefxyzABCDEf";

my $subpattern = qr/(abc)+|(def){2}|(hij)+/n;

$string =~ /
	abc 
	(\w{3,5}) # capture this in $1 

	$subpattern

	xyz
	(\w{3,5}) # capture this in $2 
	/x;

say "\$1: $1\n\$2: $2";

There’s a slight problem, though. You can’t count on that subpattern not capturing anything. The /n turns off the capturing parentheses but doesn’t turn off named captures (Item 31: Use named captures to label matches):

use v5.22;

my $string = "abc1234defdefxyzABCDEf";

my $subpattern = qr/(abc)+|(?<middle>def){2}|(hij)+/n;

$string =~ /
	abc 
	(\w{3,5}) # capture this in $1 

	$subpattern

	xyz
	(\w{3,5}) # capture this in $2 
	/x;

say <<"HERE";
\$1: $1
\$2: $2
\$3: $3
HERE

You've labelled one of the captures as middle, it also sets the numbered capture variables since the named capture hashes are really tied hashes into the numbered capture variables. You get an extra capture here:

$1: 1234
$2: def
$3: ABCDE

You see the same behavior with an non-pre-compiled interpolated pattern, although you have to set the /n flag in the larger pattern:

my $subpattern = "(abc)+|(?def){2}|(hij)+";
say "subpattern is $subpattern";

my $pattern = qr/
	abc 
	(\w{3,5}) # capture this in $1 

	(?n:  $subpattern  )

	xyz
	(\w{3,5}) # capture this in $2 
	/x;
say "pattern is $pattern";

This means you can't trust the positions of the match variables after an interpolated pattern even if you had turned off capturing. If you're not interpolating a pre-compiled pattern that's not a problem. That it's a problem is some cases is a problem; Perl has enough special cases already. This is by design so a named capture still works (a poor design I think). So far, this isn't an experimental feature but I think it should be; I foresee a lot of confusion for this feature.

There's another small issue with this feature. You can turn off the no-capture feature, but you need extra parens to do it.

use v5.22;

$_ = 'cat dog bird';

if( /(?-n:cat) (dog) (bird)/n ) {
	say "1: \$1 is <$1>";
	}

if( /(?-n:(cat)) (dog) (bird)/n ) {
	say "2: \$1 is <$1>";
	}

The output shows that it's the second one that works:

1: $1 is <>
2: $1 is <cat>

At first blush, you might think that the -n turns on capturing in the parentheses that contain it. Or you might not. But someone is going to think that and they might even ask about it on StackOverflow. However, turning off no-capture applies to the parentheses that come after it. It's a little odd, but it makes sense after you think about it a bit. Or maybe it doesn't.

Make a more complicated situation to push the boundaries of ridiculousness. You can build up a pattern with several levels of interpolation that alternately turn on and turn off capturing:

use v5.22;

my $dog    = qr/(dog)/;
my $kitty  = qr/(cat)/n;
my $cat    = qr/(?-n:($kitty))/n;
my $middle = qr/a( $cat and a)? $dog/n;

$_ = "I have a cat and a dog";

if( /(?-n:I have ($middle))/n ) {
	say <<"HERE";
\$1: $1
\$2: $2
\$3: $3
HERE
	}

You may have thought that cat should not be captured, but it shows up in $2:

$1: a cat and a dog
$2: cat
$3: dog

It's not the pattern from $kitty that's causing that. It's the one in $cat that has capturing turned off with the /n but then turned on inside the pattern. That's not what causes the capture though; it's the extra parens inside the pattern in $cat that form a new capture. It's not so easy to see what's causing what anymore, even if this is a silly example.

Leave a comment

0 Comments.

Leave a Reply


[ Ctrl + Enter ]