Perl v5.30 lets you match more with the general quantifier

Does the {N,} really match infinite repetitions in a Perl regular expression? No, it never has. You’ve been limited to 32,766 repetitions. Perl v5.30 is about to double that for you. And, if you are one of the people who needed more, I’d like to hear your story.

That 32,766 is the maximum 15-bit number (16 bits with one reserved for sign):

$ perl -e 'print 0b0111_1111_1111_1111'
32767

Reading the perldelta for v5.29.4 (the development track leading to v5.30), I saw that’s you’re about to get double that.

Do you need that sign bit? What if you put a negative number in there? As a literal, perl warns you:

$ perl -e '/.{0,1}/'
$ perl -e '/.{0,-1}/'
Unescaped left brace in regex is deprecated here (and will be fatal in Perl 5.30), passed through in regex; marked by <-- HERE in m/.{ <-- HERE 0,-1}/ at -e line 1.

Read that warning! Since this isn't a valid quantifier, the regex engine thinks that the { is a literal left brace. That's allowed now but has been deprecated. Perl v5.26 tried to do this but we found out that _autoconf_ was using it. This time for sure!

Curiously, this passes a syntax check:

$ perl -c -e '/.{0,-1}/'
Unescaped left brace in regex is deprecated here (and will be fatal in Perl 5.30), passed through in regex; marked by <-- HERE in m/.{ <-- HERE 0,-1}/ at -e line 1.
-e syntax OK

What if it's in a variable? With a positive value it's no problem. With a negative value you get the warning again:

$ perl -e '$n = 1; /.{0,$n}/'
$ perl -e '$n = -1; /.{0,$n}/'
Unescaped left brace in regex is deprecated here (and will be fatal in Perl 5.30), passed through in regex; marked by <-- HERE in m/.{ <-- HERE 0,-1}/ at -e line 1.

Try it without a value specified in the literal code. The -- is there to end the argument list so the shell doesn't think that -1 is a command-line option:

$ perl -e '$n = shift; /.{0,$n}/' 1
$ perl -e '$n = shift; /.{0,$n}/' -- -1
Unescaped left brace in regex is deprecated here (and will be fatal in Perl 5.30), passed through in regex; marked by <-- HERE in m/.{ <-- HERE 0,-1}/ at -e line 1.

Trying to match too much

The current situation is a bit different if you go past the upper bound. You still get a runtime error:

$ perl -e '$n = shift; /.{0,$n}/' 32766
$ perl -e '$n = shift; /.{0,$n}/' 32767
Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/.{ <-- HERE 0,32767}/ at -e line 1.

It's important to note here that I'm running this on macOS 10.14 using Apple's compiler tools. Your system or compiler might choose different values.

The v5.29.5 docs say:

The maximum number of times a pattern can match has been doubled to 65535

This means if you specify qr/a+/ that there can be anywhere from 1 through 65535 "a"'s in a row, instead of 32267 as previously.

You can force perl to tell you what the limit is by doubling the number each time. The number you get here is much smaller than the one I discovered:

$ perl -e '$_ **= $_ , / {1,$_} / for 2 .. 42;'
Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/ { <-- HERE 46656} / at -e line 1.

Let's look at that next.

The + quantifier

The + can be represented as the general quantifier {1,}. That's how we present that in Learning Perl. Here's an extract of perlre:

The "*" quantifier is equivalent to {0,} , the "+" quantifier to {1,} , and the "?" quantifier to {0,1} . n and m are limited to non-negative integral values less than a preset limit defined when perl is built. This is usually 32766 on the most common platforms.

Does that have the same limit? Are they actually the same?

Let's try that in a program. First, I'll make a small string that only has the letter "a" then try to match that it only has the letter "a". The \A matches only at the absolute beginning of the string and the \z matches only at the absolute end of the string (the \Z allows a trailing newline):

use v5.28;

my $length = $ARGV[0] // 10;
my $string = 'a' x $length;
say $string =~ /\A a+ \z/x ? 'Matched' : 'Missed';

This regex should match anything that only has "a" between the absolute beginning and end of the string. The quantifier must match everything up to the end of the string. I run this and it matches, as I expected:

$ perl5.28.0 long_a.pl
Matched

The dangerous part here is that I've confirmed that my regex works. Now that I've convinced that my program works, I'll take that length from the command line. The + can match much more than the 32,766 or 65,535:

$ perl long_a.pl 10
Matched
$ perl long_a.pl 32766
Matched
$ perl long_a.pl 32767
Matched
$ perl5.28.0 long_a.pl 2147483647
String length is 2147483647
Matched
$ perl5.28.0 long_a.pl 2147483648
String length is 2147483648
Missed

Curious! It's not equivalent to {1,}. I match much, much more that way. Again, if this is an important feature for you, I want to hear about it. Who's matching that many repetitions?

Does that 2147483647 number look special? It's half of the 32-bit maximum;

$ perl -le 'print 0xFFFFFFFF / 2'
2147483647.5

Trying it with v5.29.5

Now try the same thing with the latest development perl. With v5.29.5, the previous command lines can handle up to 65,535:

$ perl5.29.5 -e '$n = shift; /.{0,$n}/' 32767
$ perl5.29.5 -e '$n = shift; /.{0,$n}/' 65355
$ perl5.29.5 -e '$n = shift; /.{0,$n}/' 65356
$ perl5.29.5 -e '$n = shift; /.{0,$n}/' 65535
Quantifier in {,} bigger than 65534 in regex; marked by <-- HERE in m/.{ <-- HERE 0,65535}/ at -e line 1.

What about the +? The behavior stays the same:

$ perl5.29.5 long_a.pl 10
Matched
$ perl5.29.5 long_a.pl 32766
Matched
$ perl5.29.5 long_a.pl 2147483647
Matched
$ perl5.29.5 long_a.pl 2147483648
Missed

Things to Remember

  • The general quantifier {,} has a maximum number of repetitions, even though that is probably more than you ever need.
  • The + should be equivalent, but can match much more.
  • v5.30 will increase the limit for {,}. If you've been working around that limitation, I want to hear about what you are doing.
Leave a comment

2 Comments.

  1. I was the one who made the change and wrote the perldelta. It turns out that the change is valid, I believe, but the perldelta is wrong due to my incomplete understanding of how things work. It should have read something like

    “The upper limit ‘n’ specifiable in a regular expression quantifier of the form ‘{m,n}’ has been doubled to 65534.

    “The meaning of an unbounded upper quantifier ‘{m,}’ remains unchanged. It matches at least 2**31 – 1 times.”

    Some platforms (maybe not current ones; I don’t know, I don’t keep track) make every number 64 bits wide. It’s on those that the unbounded limit could be larger than 2**31. But Perl takes special care to limit ‘n’ to 65534 max, even on such platforms.

    The bottom line is that the upper limit you can specify is much lower than the infinity that perl uses internally. That could be changed fairly easily, but no one has, to my knowledge, ever complained that it’s too low, and we just doubled it anyway.

    You can say ‘use re qw(Debug ALL)’ before a regular expression you are curious about to see in more detail what is happening. If you do so around a quantifier “{1,}”, you’ll see that what gets generated is identical to what gets generated if you had instead said “+”.

    And, BTW, the reason

    $ perl -c -e ‘/.{0,-1}/’

    passes a syntax check is that is legal; it just doesn’t match what probably was intended. Again if you say ‘use re qw(Debug ALL)’, you can see what’s going on.

    Final program:
    1: REG_ANY (2)
    2: EXACT (5)
    5: END (0)

    REG_ANY matches any single character except newline. That’s the dot in the input. Then the exact string ‘{0,-1}’ must be matched. The warning message no longer says (in 5.29) that this will be illegal in 5.30. It will remain legal, and the warning message will remain. Pay attention to it. It’s telling you that the left brace is to be matched literally, which also implies that what follows isn’t going to be a quantifier. What will be illegal in 5.30 are just the constructs that we intend to change the meaning of. This is to limit the potential breakage of existing code, while still warning that what they might have thought they were getting is wrong. It allows _autoconf_ to not change, for example. (As a head’s up, after 5.30, we can relax the syntax to allow some spaces and to be able to say “{,n}”. Currently the lower limit must be specified. It will also allow us to extend various escape sequences, so that \w{foo} could mean the “foo” specialization of \w, for whatever specializations we come up with in the future.)

  2. The formatter ate what I typed in for what the EXACT matches. It is “{0,1}”

Leave a Reply to Karl Williamson


[ Ctrl + Enter ]