Know your character classes under different semantics

Now the Perl is Unicode aware (and, it’s been that way for a long time even if you haven’t been), you might have to be more careful in your regular expressions. Some of the character classes are much more inclusive than the ASCIIphile might imagine. In ASCIIland, character and byte semantics are the same thing. No matter which way you treat your strings you get the same answer. With Unicode, however, Perl might now treat certain sequences of bytes as one character. The character and byte semantics have diverged. If you let Perl treat your data as character data when it really isn’t, you can run into problems. If you aren’t already doing something special, you’re probably using character semantics. Continue reading “Know your character classes under different semantics”

Use the \N regex character class to get “not a newline”

Perl 5.12 introduced an experimental regex character class to stand in for every character except one, the newline. The \N character class is everything but the newline.

In prior versions of Perl, this is the same thing as the . meta character. That is, it’s the same as long as someone doesn’t add the /s to the match or substitution operator or the regex quoting operator, or doesn’t insert the option inside the pattern: Continue reading “Use the \N regex character class to get “not a newline””

Use a smart match to match several patterns at once

The smart match operator (Item 23. Make work easier with smart matching) reduces many common comparisons to a few keystrokes, keeping with Perl’s goal of making the common things easy. You can use the smart match operator to make even less common tasks, such as matching many regular expressions at the same time, just as easy. This Item shows you how to use the smart match to see if at least one of a series of regexes matches a string. Continue reading “Use a smart match to match several patterns at once”

Let perl create your regex stringification

Perl 5.14 changes how regular expression objects stringify. This might not seem like a big deal at first, but it exposes a certain sort of bug that you may have never considered. It even broke several modules on CPAN. If you previously tested for hard-coded stringifications of patterns, Perl 5.14 is probably going to break your code. Continue reading “Let perl create your regex stringification”

Use branch reset grouping to number captures in alternations

Perl’s regular expressions have a simple rule for capturing groups. It counts the order of left parentheses to assign capture variables. Not all capture groups must actually match parts of the string, and Perl doesn’t care if they do. Perl assigns capture groups inside an alternation consecutively, even though it knows that only one branch of the alternation will match. Perl 5.10 adds the branch reset, (?|alternation) which mitigates that, though. Continue reading “Use branch reset grouping to number captures in alternations”

Match Unicode characters by property value

A Unicode character has properties; it knows things about itself. Perl v5.10 introduced a way to match a character that has certain properties that v5.10 supports. In some cases you can match a particular property value. Now v5.12 allows you can match any Unicode property by its value. The newly-supported ones include Numeric_Value and Age, for example:

\p{Numeric_Value: 1}
\p{Nv=7}
\p{Age: 3.0}

Continue reading “Match Unicode characters by property value”

Detect regular expression match variables in your code

[UPDATE: this is not a problem in v5.18 and later.]

In Item 33: “Watch out for match variables”, you found out that the match variable $`, $&, and $` come with a performance hit. With all of the module code that you might use, you might be using those variables even though you didn’t code with them yourself. Continue reading “Detect regular expression match variables in your code”

Use /gc and \G in matches to separate alternations in separate, smaller patterns

Perl keeps track of the last position in a string where it had a successful global match (using the /g flag). You can access this position with the pos operator. With Perl 5.10, you can use the /p switch to get the per-match variable ${^MATCH} instead of the performance-dampening $&: Continue reading “Use /gc and \G in matches to separate alternations in separate, smaller patterns”

Know the difference between regex and match operator flags

The match and substitution operators, as well as regex quoting with qr//, use flags to signal certain behavior of the match or interpretation of the pattern. The flags that change the interpretation of the pattern are listed in the documentation for qr// in perlop (and maybe in other places in earlier versions of the documentation): Continue reading “Know the difference between regex and match operator flags”