The vertical tab is part of \s in Perl 5.18

Up to v5.18, the vertical tab wasn’t part of the \s character class shortcut for ASCII whitespace. No one really knows why. It was curious trivia that I pointed out in Know your character classes under different semantics. Whitespace in ASCII, POSIX, and Unicode represented different sets. Perl whitespace was different from POSIX whitespace by only the exclusion of the vertical tab. Now that little oversight is fixed. Continue reading “The vertical tab is part of \s in Perl 5.18”

Ignore part of a substitution’s match

Normally, a substitution replaces everything it matched, but v5.10 adds a feature that allows you to ignore part of the match. The \K excludes from $& anything to its left. This feature has already made it into PCRE. It doesn’t have an official name, so I’ll call it the match reset operator because it resets the start of $&.

Continue reading “Ignore part of a substitution’s match”

Define grammars in regular expressions

[ This is the 100th Item we’ve shared with you in the two years this blog has been around. We deserve a holiday and we’re taking it, so read us next year! Happy Holidays.]

Perl 5.10 added rudimentary grammar support in its regular expressions. You could define many subpatterns directly in your pattern, use them to define larger subpatterns, and, finally, when you have everything in place, let Perl do the work. Continue reading “Define grammars in regular expressions”

Use lookarounds to split to avoid special cases

There are some regular expression tricks that can help you deal with balanced delimiters in a string. The split command takes a pattern, removes the parts of a string that match that pattern, and give you a list of the parts of the string between those separators. Said another way, split works when the parts you don’t need are between the values. Continue reading “Use lookarounds to split to avoid special cases”

Set default regular expression modifiers

Are you tired of adding the same modifiers to all of your regular expressions? For instance, if you might always add the /u modifier to turn on Unicode semantics on all of your patterns, including qr//, m//, and s///. Instead of remembering to do that to every pattern, the re that ships with Perl 5.14 now lets you do that for all patterns in the current lexical scope. You can also turn off a modifier for the rest of the scope. Continue reading “Set default regular expression modifiers”

The \R generic line ending

Perl v5.10 adds a regular expression shortcut \R that matches anything the Unicode specification thinks is a line ending. It looks similar to a character class shortcut but it’s not. It can match the sequence of carriage-return line-feed, but character classes don’t match sequence. Continue reading “The \R generic line ending”

Find dates with Regexp::Common

[This is a mid-week bonus item]

Suppose you want to find some dates inside a big string. The problem with dates is that there are some many ways to write them, and even if you can come up with a pattern to get the structure right, can you handle the different locales and languages that use different words to refer to the same day or month? Continue reading “Find dates with Regexp::Common”

Use Regexp::Common to find locale-specific dates

[This is a mid-week bonus item, and it’s a bit of a departure from much of what you have already seen on this blog. This is just some code that I had to write this week and I thought you’d like to see it.]

I had to find some dates inside a big string, and the problem with dates is that there are some many ways to write them, and even if I get the format right, some of the machines might use another locale. My string comes from an ls I run as a remote command, which might show the date in one of two formats. The files changed in the last six months replaces the year with the time: Continue reading “Use Regexp::Common to find locale-specific dates”

Know your character classes under different semantics

Now the Perl is Unicode aware (and, it’s been that way for a long time even if you haven’t been), you might have to be more careful in your regular expressions. Some of the character classes are much more inclusive than the ASCIIphile might imagine. In ASCIIland, character and byte semantics are the same thing. No matter which way you treat your strings you get the same answer. With Unicode, however, Perl might now treat certain sequences of bytes as one character. The character and byte semantics have diverged. If you let Perl treat your data as character data when it really isn’t, you can run into problems. If you aren’t already doing something special, you’re probably using character semantics. Continue reading “Know your character classes under different semantics”