Category Archives: regular expressions

Perl v5.24 adds a line break word boundary

Perl v5.24 adds a linebreak word boundary, \b{lb}, to go along the new word boundaries added in v5.22. This is part of Perl’s increasing conformance with the regular expression requirements in Unicode Technical Standard #18. The Unicode::LineBreak implements the same thing, although you have to do a lot more work.

Perl v5.22 adds fancy Unicode word boundaries

Perl v5.22’s regexes added four Unicode boundaries to go along with the vanilla “word” boundary, \b, that you’ve been using for years. These new assertions aren’t going to match perfectly with your expectations of human languages (the holy grail of natural language processing), but they do okay-ish. Although these appear in v5.22.0, as a late […]

No more -no_match_vars

The English module translates Perl’s cryptic variable names to English equivalents. For instance, $_ becomes $ARG. This means that the match variable $& becomes $MATCH. This also means that using the English module triggered the performance issue associated with the match variables $`, $&, and $’ even if you didn’t use those variables yourself—the module […]

Use /aa to get ASCII semantics in regexes, for reals this time

When Perl made regexes more Unicode aware, starting in v5.6, some of the character class definitions and match modifiers changed. What you expected to match \d, \s, or \w are more expanvise now (Know your character classes under different semantics). Most of us probably didn’t notice because the range of our inputs is limited.

Turn capture groups into cluster groups

Perl v5.22 adds the /n regex flag that turns all parentheses groups in its scope into non-capturing groups. This can be handy when you want to capture almost nothing but still need to many cluster parts. You do less typing to get that.

Perl v5.18 adds character class set operations

Perl v5.18 added experimental character code set operations, a requirement for full Unicode support according to Unicode Technical Standard #18, which specifies what a compliant language must support and divides those into three levels. The perlunicode documentation lists each requirement and its status in Perl. Besides some regular expression anchors handling all forms of line […]

Enforce ASCII semantics when you only want ASCII

When Perl made regexes more Unicode aware, starting in v5.6, some of the character class definitions and match modifiers changed. What you expected to match \d, \s, or \w are more expanvise now (

Perl v5.20 fixes taint problems with locale

Perl v5.20 fixes taint checking in regular expressions that might use the locale in its pattern, even if that part of the pattern isn’t a successful part of the match. The perlsec documentation has noted that taint-checking did that, but until v5.20, it didn’t. The only approved way to untaint a variable is through a […]

The vertical tab is part of \s in Perl 5.18

Up to v5.18, the vertical tab wasn’t part of the \s character class shortcut for ASCII whitespace. No one really knows why. It was curious trivia that I pointed out in Know your character classes under different semantics. Whitespace in ASCII, POSIX, and Unicode represented different sets. Perl whitespace was different from POSIX whitespace by […]

Define grammars in regular expressions

[ This is the 100th Item we’ve shared with you in the two years this blog has been around. We deserve a holiday and we’re taking it, so read us next year! Happy Holidays.] Perl 5.10 added rudimentary grammar support in its regular expressions. You could define many subpatterns directly in your pattern, use them […]