regular expressions – Page 3 – The Effective Perler

Perl v5.20 fixes taint problems with locale

Perl v5.20 fixes taint checking in regular expressions that might use the locale in its pattern, even if that part of the pattern isn’t a successful part of the match. The perlsec documentation has noted that taint-checking did that, but until v5.20, it didn’t.

The only approved way to untaint a variable is through a successful pattern match with captures: Continue reading “Perl v5.20 fixes taint problems with locale”

The vertical tab is part of \s in Perl 5.18

Up to v5.18, the vertical tab wasn’t part of the \s character class shortcut for ASCII whitespace. No one really knows why. It was curious trivia that I pointed out in Know your character classes under different semantics. Whitespace in ASCII, POSIX, and Unicode represented different sets. Perl whitespace was different from POSIX whitespace by only the exclusion of the vertical tab. Now that little oversight is fixed. Continue reading “The vertical tab is part of \s in Perl 5.18”

Ignore part of a substitution’s match

Normally, a substitution replaces everything it matched, but v5.10 adds a feature that allows you to ignore part of the match. The \K excludes from $& anything to its left. This feature has already made it into PCRE. It doesn’t have an official name, so I’ll call it the match reset operator because it resets the start of $&.

Continue reading “Ignore part of a substitution’s match”

Define grammars in regular expressions

[ This is the 100th Item we’ve shared with you in the two years this blog has been around. We deserve a holiday and we’re taking it, so read us next year! Happy Holidays.]

Perl 5.10 added rudimentary grammar support in its regular expressions. You could define many subpatterns directly in your pattern, use them to define larger subpatterns, and, finally, when you have everything in place, let Perl do the work. Continue reading “Define grammars in regular expressions”

Use lookarounds to split to avoid special cases

There are some regular expression tricks that can help you deal with balanced delimiters in a string. The split command takes a pattern, removes the parts of a string that match that pattern, and give you a list of the parts of the string between those separators. Said another way, split works when the parts you don’t need are between the values. Continue reading “Use lookarounds to split to avoid special cases”

Use lookarounds to eliminate special cases in split

The split built-in takes a string and turns it into a list, discarding the separators that you specify as a pattern. This is easy when the separator is simple, but seems hard if the separator gets more tricky. Continue reading “Use lookarounds to eliminate special cases in split”

Set default regular expression modifiers

Are you tired of adding the same modifiers to all of your regular expressions? For instance, if you might always add the /u modifier to turn on Unicode semantics on all of your patterns, including qr//, m//, and s///. Instead of remembering to do that to every pattern, the re that ships with Perl 5.14 now lets you do that for all patterns in the current lexical scope. You can also turn off a modifier for the rest of the scope. Continue reading “Set default regular expression modifiers”

The \R generic line ending

Perl v5.10 adds a regular expression shortcut \R that matches anything the Unicode specification thinks is a line ending. It looks similar to a character class shortcut but it’s not. It can match the sequence of carriage-return line-feed, but character classes don’t match sequence. Continue reading “The \R generic line ending”

Find dates with Regexp::Common

[This is a mid-week bonus item]

Suppose you want to find some dates inside a big string. The problem with dates is that there are some many ways to write them, and even if you can come up with a pattern to get the structure right, can you handle the different locales and languages that use different words to refer to the same day or month? Continue reading “Find dates with Regexp::Common”

Use Regexp::Common to find locale-specific dates

[This is a mid-week bonus item, and it’s a bit of a departure from much of what you have already seen on this blog. This is just some code that I had to write this week and I thought you’d like to see it.]

I had to find some dates inside a big string, and the problem with dates is that there are some many ways to write them, and even if I get the format right, some of the machines might use another locale. My string comes from an ls I run as a remote command, which might show the date in one of two formats. The files changed in the last six months replaces the year with the time: Continue reading “Use Regexp::Common to find locale-specific dates”