Use atomic matching for complex non-backtracking

You can sometimes improve the performance of your regular expression by preventing parts of it from backtracking when you know that might be useful. Item 38. Avoid unnecessary backtracking had many techniques for this, although it did not mention atomic matching (a feature added in v5.005).

Continue reading “Use atomic matching for complex non-backtracking”

Ignore part of a substitution’s match

Normally, a substitution replaces everything it matched, but v5.10 adds a feature that allows you to ignore part of the match. The \K excludes from $& anything to its left. This feature has already made it into PCRE. It doesn’t have an official name, so I’ll call it the match reset operator because it resets the start of $&.

Continue reading “Ignore part of a substitution’s match”

Define grammars in regular expressions

[ This is the 100th Item we’ve shared with you in the two years this blog has been around. We deserve a holiday and we’re taking it, so read us next year! Happy Holidays.]

Perl 5.10 added rudimentary grammar support in its regular expressions. You could define many subpatterns directly in your pattern, use them to define larger subpatterns, and, finally, when you have everything in place, let Perl do the work. Continue reading “Define grammars in regular expressions”

The \R generic line ending

Perl v5.10 adds a regular expression shortcut \R that matches anything the Unicode specification thinks is a line ending. It looks similar to a character class shortcut but it’s not. It can match the sequence of carriage-return line-feed, but character classes don’t match sequence. Continue reading “The \R generic line ending”

Use the > and < pack modifiers to specify the architecture

Byte-order modifiers are one of the Perl 5.10 features farther along in perl5100delta, after the really big features. To any pack format, you can append a < or a > to specify that the format is little-endian or big-endian, respectively. This allows you to handle endianness in the formats that don’t have specify versions for each architecture already, as well as apply endianness to groups. Continue reading “Use the > and < pack modifiers to specify the architecture"

Use a smart match to match several patterns at once

The smart match operator (Item 23. Make work easier with smart matching) reduces many common comparisons to a few keystrokes, keeping with Perl’s goal of making the common things easy. You can use the smart match operator to make even less common tasks, such as matching many regular expressions at the same time, just as easy. This Item shows you how to use the smart match to see if at least one of a series of regexes matches a string. Continue reading “Use a smart match to match several patterns at once”

Set default values with the defined-or operator.

[This is a mid-week bonus Item since it’s so short]

Prior to Perl 5.10, you had to be a bit careful checking a Perl variable before you set a default value. An uninitialized value and a defined but false value both acted the same in the logical || short-circuit operator. The Perl idiom to set a default value looks like this: Continue reading “Set default values with the defined-or operator.”

Use branch reset grouping to number captures in alternations

Perl’s regular expressions have a simple rule for capturing groups. It counts the order of left parentheses to assign capture variables. Not all capture groups must actually match parts of the string, and Perl doesn’t care if they do. Perl assigns capture groups inside an alternation consecutively, even though it knows that only one branch of the alternation will match. Perl 5.10 adds the branch reset, (?|alternation) which mitigates that, though. Continue reading “Use branch reset grouping to number captures in alternations”

Match Unicode characters by property value

A Unicode character has properties; it knows things about itself. Perl v5.10 introduced a way to match a character that has certain properties that v5.10 supports. In some cases you can match a particular property value. Now v5.12 allows you can match any Unicode property by its value. The newly-supported ones include Numeric_Value and Age, for example:

\p{Numeric_Value: 1}
\p{Age: 3.0}

Continue reading “Match Unicode characters by property value”