Use /aa to get ASCII semantics in regexes, for reals this time

When Perl made regexes more Unicode aware, starting in v5.6, some of the character class definitions and match modifiers changed. What you expected to match \d, \s, or \w are more expanvise now (Know your character classes under different semantics). Most of us probably didn’t notice because the range of our inputs is limited.

To get the ASCII semantics back, you can use v5.14’s /a match flag to restore their pre-v5.6 meanings. I wrote about that in Enforce ASCII semantics when you only want ASCII.

That’s not all, though. There’s a problem with case insensitive matches. Some of the “wide” characters lowercase into the ASCII range. Well, so far there’s exactly one, but that doesn’t mean there might be more later. The Kelvin symbol, K (U+212A), lowercases into to k (U+004B). To avoid problems with fonts and indistinguishable characters, I create the characters with the \x{} sequences:

use v5.16;

my $lower_k = "\x{006B}";

if( $lower_k =~ /\x{212A}/ ) {
	say "Matches without case insensitivity";
	}

if( $lower_k =~ /\x{212A}/i ) {
	say "Matches with case insensitivity";
	}

if( $lower_k =~ /\x{212A}/ia ) {
	say "Matches with case insensitivity, /a";
	}

if( $lower_k =~ /\x{212A}/iaa ) {
	say "Matches with case insensitivity, /aa";
	}

The Kelvin sign matches with case insensivity, even with the /a flag. However, it doesn’t match when you double up with /aa. That extra /a means, “I mean ASCII, j/k, but seriously”.

Matches with case insensitivity
Matches with case insensitivity, /a

Unless you want to handle that Kelvin symbol specially, you should be safe always doubling up on the /a.

Read perlre for the rest of the story.

2 thoughts on “Use /aa to get ASCII semantics in regexes, for reals this time”

  1. So far, the one case I showed is the only case where /aa matters, but I don’t think case-insensitivity is necessarily the only switch that /aa might affect in other revisions of the Unicode standard.

Comments are closed.