regular expressions – The Effective Perler

Insignificant whitespace in brace constructs

This is a chapter in Perl New Features, a book from Perl School that you can buy on LeanPub or Amazon. Your support helps me to produce more content.

Perl’s coterie of brace constructs become a bit more lenient in v5.34. These things appear in double-quotish constructs, such as \N{CHARNAME} to specify a character by name. And, patterns count as a double-quoted construct (unless you use ' as the delimiter), so these new rules apply to brace constructs such as \k{} (for named backreferences) and the general quantifier, {n,m}.

Continue reading “Insignificant whitespace in brace constructs”

“Up to N” matches

This is a chapter in Perl New Features, a book from Perl School that you can buy on LeanPub or Amazon. Your support helps me to produce more content.

Perl’s general regex quantifier, {n,m} takes a minimum and maximum number of matches. If you leave out the maximum number, like {n,}, you have to match the preceding thing at least n times but as many times as it can match: the maximum is unbounded.

Continue reading ““Up to N” matches”

Match Unicode property values with a wildcard

Not all characters with a numerical value are “digits”, in the Perl sense. You saw in Match Unicode characters by property value, in which you selected characters by a Unicode property value. That showed up again in Perl v5.18 adds character class set operations. Perl v5.30 adds a slightly easier way to get at multiple numerical values at the same time. Now you can match Unicode property values with wildcards (Unicode TR 18), which are sorta like Perl patterns. Don’t get too excited, though, because these can be expensive.

Continue reading “Match Unicode property values with a wildcard”

Match Unicode character names with a pattern

Perl has some of the best Unicode support out there, and it keeps getting better. Perl v5.32 supports Unicode 13, and you can now apply patterns to character names. You probably don’t want to do that though.

First, the Unicode Character Database catalogs each character, giving it a code number, a name, and many other properties.

Continue reading “Match Unicode character names with a pattern”

Use a variable-width lookbehind if it won’t match more than 255 characters

In Ignore part of a substitution’s match, I showed you the match resetting \K—it’s basically a variable-width positive lookbehind assertion. It’s a special feature to work around Perl’s lack of variable-width lookbehinds. However, v5.30 adds an experimental feature to allow a limited version of a variable-width lookbehind.

Continue reading “Use a variable-width lookbehind if it won’t match more than 255 characters”

Perl 5.30 fixes single quoted qr” with \N{}

The qr// operator allows you to compile a regular expression without applying it to anything. You get the pattern without the match, and you can reuse the pattern as often as you like. Before v5.30, it had an inconsistency with \N{} sequences, but that’s fixed now.

Continue reading “Perl 5.30 fixes single quoted qr” with \N{}”

Match only the same Unicode script

Earlier this year, this website was the target of some sort of attack in which a bot sent seemingly random data in its requests. The attack wasn’t that big of a deal since I easily blocked it with Cloudflare, but it was interesting. The apparently random data was actually a mix of Latin, Hangul, and Cyrillic. Domain hacks with unusual Unicode characters shows some of these exploits. Curiously, v5.28 added some regex feature that deals with this sort of nonsense.

Continue reading “Match only the same Unicode script”

Use atomic matching for complex non-backtracking

You can sometimes improve the performance of your regular expression by preventing parts of it from backtracking when you know that might be useful. Item 38. Avoid unnecessary backtracking had many techniques for this, although it did not mention atomic matching (a feature added in v5.005).

Continue reading “Use atomic matching for complex non-backtracking”

Use alpha assertions for more understandable regexes

[This feature stabilizes in Perl v5.32]

Perl v5.28 adds more-readable, alternate spelled-out forms for some of its regular expression extended patterns. Then, to make those slightly less readable, there are very short initialisms for those. Although these might seem superfluous now, the ability to define new syntax without relying on the limited number of ASCII symbols.

Continue reading “Use alpha assertions for more understandable regexes”

Perl v5.30 lets you match more with the general quantifier

Does the {N,} really match infinite repetitions in a Perl regular expression? No, it never has. You’ve been limited to 32,766 repetitions. Perl v5.30 is about to double that for you. And, if you are one of the people who needed more, I’d like to hear your story.

Continue reading “Perl v5.30 lets you match more with the general quantifier”