Perl v5.30 lets you match more with the general quantifier

Does the {N,} really match infinite repetitions in a Perl regular expression? No, it never has. You’ve been limited to 32,766 repetitions. Perl v5.30 is about to double that for you. And, if you are one of the people who needed more, I’d like to hear your story.

That 32,766 is the maximum 15-bit number (16 bits with one reserved for sign):

$ perl -e 'print 0b0111_1111_1111_1111'
32767

Reading the perldelta for v5.29.4 (the development track leading to v5.30), I saw that’s you’re about to get double that.

Do you need that sign bit? What if you put a negative number in there? As a literal, perl warns you:

$ perl -e '/.{0,1}/'
$ perl -e '/.{0,-1}/'
Unescaped left brace in regex is deprecated here (and will be fatal in Perl 5.30), passed through in regex; marked by <-- HERE in m/.{ <-- HERE 0,-1}/ at -e line 1.

Read that warning! Since this isn't a valid quantifier, the regex engine thinks that the { is a literal left brace. That's allowed now but has been deprecated. Perl v5.26 tried to do this but we found out that _autoconf_ was using it. This time for sure!

Curiously, this passes a syntax check:

$ perl -c -e '/.{0,-1}/'
Unescaped left brace in regex is deprecated here (and will be fatal in Perl 5.30), passed through in regex; marked by <-- HERE in m/.{ <-- HERE 0,-1}/ at -e line 1.
-e syntax OK

What if it's in a variable? With a positive value it's no problem. With a negative value you get the warning again:

$ perl -e '$n = 1; /.{0,$n}/'
$ perl -e '$n = -1; /.{0,$n}/'
Unescaped left brace in regex is deprecated here (and will be fatal in Perl 5.30), passed through in regex; marked by <-- HERE in m/.{ <-- HERE 0,-1}/ at -e line 1.

Try it without a value specified in the literal code. The -- is there to end the argument list so the shell doesn't think that -1 is a command-line option:

$ perl -e '$n = shift; /.{0,$n}/' 1
$ perl -e '$n = shift; /.{0,$n}/' -- -1
Unescaped left brace in regex is deprecated here (and will be fatal in Perl 5.30), passed through in regex; marked by <-- HERE in m/.{ <-- HERE 0,-1}/ at -e line 1.

Trying to match too much

The current situation is a bit different if you go past the upper bound. You still get a runtime error:

$ perl -e '$n = shift; /.{0,$n}/' 32766
$ perl -e '$n = shift; /.{0,$n}/' 32767
Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/.{ <-- HERE 0,32767}/ at -e line 1.

It's important to note here that I'm running this on macOS 10.14 using Apple's compiler tools. Your system or compiler might choose different values.

The v5.29.5 docs say:

The maximum number of times a pattern can match has been doubled to 65535

This means if you specify qr/a+/ that there can be anywhere from 1 through 65535 "a"'s in a row, instead of 32267 as previously.

You can force perl to tell you what the limit is by doubling the number each time. The number you get here is much smaller than the one I discovered:

$ perl -e '$_ **= $_ , / {1,$_} / for 2 .. 42;'
Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/ { <-- HERE 46656} / at -e line 1.

Let's look at that next.

The + quantifier

The + can be represented as the general quantifier (1,}. That's how we present that in Learning Perl. Here's an extract of perlre:

The "*" quantifier is equivalent to {0,} , the "+" quantifier to {1,} , and the "?" quantifier to {0,1} . n and m are limited to non-negative integral values less than a preset limit defined when perl is built. This is usually 32766 on the most common platforms.

Does that have the same limit? Are they actually the same?

Let's try that in a program. First, I'll make a small string that only has the letter "a" then try to match that it only has the letter "a". The \A matches only at the absolute beginning of the string and the \z matches only at the absolute end of the string (the \Z allows a trailing newline):

use v5.28;

my $length = $ARGV[0] // 10;
my $string = 'a' x $length;
say $string =~ /\A a+ \z/x ? 'Matched' : 'Missed';

This regex should match anything that only has "a" between the absolute beginning and end of the string. The quantifier must match everything up to the end of the string. I run this and it matches, as I expected:

$ perl5.28.0 long_a.pl
Matched

The dangerous part here is that I've confirmed that my regex works. Now that I've convinced that my program works, I'll take that length from the command line. The + can match much more than the 32,766 or 65,535:

$ perl long_a.pl 10
Matched
$ perl long_a.pl 32766
Matched
$ perl long_a.pl 32767
Matched
$ perl5.28.0 long_a.pl 2147483647
String length is 2147483647
Matched
$ perl5.28.0 long_a.pl 2147483648
String length is 2147483648
Missed

Curious! It's not equivalent to {1,}. I match much, much more that way. Again, if this is an important feature for you, I want to hear about it. Who's matching that many repetitions?

Does that 2147483647 number look special? It's half of the 32-bit maximum;

$ perl -le 'print 0xFFFFFFFF / 2'
2147483647.5

Trying it with v5.29.5

Now try the same thing with the latest development perl. With v5.29.5, the previous command lines can handle up to 65,535:

$ perl5.29.5 -e '$n = shift; /.{0,$n}/' 32767
$ perl5.29.5 -e '$n = shift; /.{0,$n}/' 65355
$ perl5.29.5 -e '$n = shift; /.{0,$n}/' 65356
$ perl5.29.5 -e '$n = shift; /.{0,$n}/' 65535
Quantifier in {,} bigger than 65534 in regex; marked by <-- HERE in m/.{ <-- HERE 0,65535}/ at -e line 1.

What about the +? The behavior stays the same:

$ perl5.29.5 long_a.pl 10
Matched
$ perl5.29.5 long_a.pl 32766
Matched
$ perl5.29.5 long_a.pl 2147483647
Matched
$ perl5.29.5 long_a.pl 2147483648
Missed

Things to Remember

  • The general quantifier {,} has a maximum number of repetitions, even though that is probably more than you ever need.
  • The + should be equivalent, but can match much more.
  • v5.30 will increase the limit for {,}. If you've been working around that limitation, I want to hear about what you are doing.
Leave a comment

0 Comments.

Leave a Reply


[ Ctrl + Enter ]