regular expressions – Page 2 – The Effective Perler

Use @{^CAPTURE} to get a list of all the capture buffers

Perl v5.26 adds three new special variables related to captures. The @{^CAPTURE} is an array of all the capture buffers. %{^CAPTURE} is a alias for %+ and stores the actually-matched named capture labels as its keys. %{^CAPTURE_ALL} is an alias for %- and stores all the named capture labels and their matched (or not) values.

Continue reading “Use @{^CAPTURE} to get a list of all the capture buffers”

You must escape the left brace in a regex

Update: v5.30 brings this back. The exceptions to allow people to catch up should have already been fixed.

In v5.26 and later, you have to escape the left brace, {, in Perl regular expressions. You probably never thought about this overworked character, but lots of other people have. This is an important change because it’s a fatal issue that may cause your modules and other tools (such as an old version of autoconf!) to stop working. But, we’ve also known about this for a bit, so if you are up-to-date, things may have already been fixed.

{

Continue reading “You must escape the left brace in a regex”

/xx means more meaningless whitespace (and that’s good)

The /xx match operator flag lets you add insignificant horizontal whitespace to character classes. You’ll have to upgrade to v5.26 to use it though, but that release is right around the corner.

The /x has been around since v5.10. That allows you to spread out the parts of your pattern and to add internal comments. This flag makes horizontal whitespace insignificant.

Continue reading “/xx means more meaningless whitespace (and that’s good)”

Perl v5.24 adds a line break word boundary

Perl v5.24 adds a linebreak word boundary, \b{lb}, to go along the new word boundaries added in v5.22. This is part of Perl’s increasing conformance with the regular expression requirements in Unicode Technical Standard #18. The Unicode::LineBreak implements the same thing, although you have to do a lot more work. Continue reading “Perl v5.24 adds a line break word boundary”

Perl v5.22 adds fancy Unicode word boundaries

Perl v5.22’s regexes added four Unicode boundaries to go along with the vanilla “word” boundary, \b, that you’ve been using for years. These new assertions aren’t going to match perfectly with your expectations of human languages (the holy grail of natural language processing), but they do okay-ish. Although these appear in v5.22.0, as a late edition to the language they were partially broken in the initial release. They were fixed for v5.22.1. Continue reading “Perl v5.22 adds fancy Unicode word boundaries”

No more -no_match_vars

The English module translates Perl’s cryptic variable names to English equivalents. For instance, $_ becomes $ARG. This means that the match variable $& becomes $MATCH. This also means that using the English module triggered the performance issue associated with the match variables $`, $&, and $' even if you didn’t use those variables yourself—the module used them for you. The Devel::NYTProf debugger had a sawampersand feature to tell you one of those variables appeared in the code. We covered this in Item 33. Watch out for the match variables. Continue reading “No more -no_match_vars”

Use /aa to get ASCII semantics in regexes, for reals this time

When Perl made regexes more Unicode aware, starting in v5.6, some of the character class definitions and match modifiers changed. What you expected to match \d, \s, or \w are more expanvise now (Know your character classes under different semantics). Most of us probably didn’t notice because the range of our inputs is limited. Continue reading “Use /aa to get ASCII semantics in regexes, for reals this time”

Turn capture groups into cluster groups

Perl v5.22 adds the /n regex flag that turns all parentheses groups in its scope into non-capturing groups. This can be handy when you want to capture almost nothing but still need to many cluster parts. You do less typing to get that. Continue reading “Turn capture groups into cluster groups”

Perl v5.18 adds character class set operations

Perl v5.18 added experimental character code set operations, a requirement for full Unicode support according to Unicode Technical Standard #18, which specifies what a compliant language must support and divides those into three levels.

The perlunicode documentation lists each requirement and its status in Perl. Besides some regular expression anchors handling all forms of line boundaries (which might break older programs), set subtraction and intersection in character classes was the last feature Perl needed to be Level 1 compliant. Continue reading “Perl v5.18 adds character class set operations”

Enforce ASCII semantics when you only want ASCII

When Perl made regexes more Unicode aware, starting in v5.6, some of the character class definitions and match modifiers changed. What you expected to match \d, \s, or \w are more expanvise now (Know your character classes under different semantics). Most of us probably didn’t notice because the range of our inputs is limited. Continue reading “Enforce ASCII semantics when you only want ASCII”