Use atomic matching for complex non-backtracking

You can sometimes improve the performance of your regular expression by preventing parts of it from backtracking when you know that might be useful. Item 38. Avoid unnecessary backtracking had many techniques for this, although it did not mention atomic matching (a feature added in v5.005).

An atomic match treats a subpattern as a single unit. That entire unit matches or it doesn’t and it doesn’t allow the regex engine to backtrack into it.

Consider a pattern that matches some text, some whitespace, and some more text, where I’m using the /x to spread out the pattern with insignificant whitespace (Item 37. Make regular expressions readable.):

use v5.10;
my $pattern = qr/ \A a \s+ b /x;

$_ = "a\t\t\t\tc";
say /$pattern/ ? 'Matched' : 'Missed';

Just looking at that, you can tell that the match will fail because there are no b characters in it. It takes awhile for the regex engine to figure out that.

  • The pattern starts at the beginning of the string and matches the a
  • After the a, the \s+ matches all of the tab characters.
  • Next the pattern need to match a b. It doesn’t match that though.

When the b part fails, the engine backs up to the previous quantifier. In this example, it matched four characters, to the engine reduces that to matching three characters then tries again. That fails because the tab character is not a b. The quantifier give up another character, but the penultimate tab is also not a b. This continues until the quantifier cannot give up anything more since it has to match at least one tab.

The rxrx program from Regexp::Debugger helps you visualize this, with the caveat that it modifies your regex so it can probe it, so take that into account when rxrx reports these steps. It reports that it takes 33 steps before the regex engine gives up.

One way to get around this is to add the non-backtracking modifier to the + quantifier, making it the ++:

use v5.10;
my $pattern = qr/ \A a \s++ b /x;

$_ = "a\t\t\t\tc";
say /$pattern/ ? 'Matched' : 'Missed';

This shaves off a few steps because the + quantifier doesn’t backtrack. It gives up everything it matched and starts again from the beginning of the pattern. The pattern moves over a character and tries the entire pattern again (a move that an anchor would solve, but that’s not the point here). This time it fails in 25 steps. Imagine all of these with much later patterns and many more tabs. The backtracking can be pathological.

Another way is the atomic match group, (?> ... ). This also prevents the regex engine from backtracking, although this time it fails in 30 steps:

But this is too simple for the atomic match because its grouping a single thing with a quantifier. It’s better when you want to group something that’s complex itself. Here’s a only-slightly less simple pattern that has two zero-or-more quantifiers, which fails very quickly:

use v5.10;
my $pattern = qr/ \A a (?> b* c* ) d /x;

$_ = "abbbbccccz";
say /$pattern/ ? 'Matched' : 'Missed';

The backtracking inside a sequence of zero-or-more constructs can quickly get out of hand. Imagine a target string with thousands of bs followed by thousands of cs, but with no d in the string:

use v5.10;
my $pattern = qr/ \A a b* c* d /x;

my $b = 'b' x 1_000;
my $c = 'c' x 1_000;

$_ = "a${b}${c}z";
say /$pattern/ ? 'Matched' : 'Missed';

The regex engine will backtrack repeatedly through those 2,000 characters.