Enforce ASCII semantics when you only want ASCII

When Perl made regexes more Unicode aware, starting in v5.6, some of the character class definitions and match modifiers changed. What you expected to match \d, \s, or \w are more expanvise now (Know your character classes under different semantics). Most of us probably didn’t notice because the range of our inputs is limited.

To get the ASCII semantics back, you can use v5.14’s /a match flag to restore their pre-v5.8 meanings.

If you look for \d in later Perls, for example, you get a long list:

use charnames qw(:full);

foreach ( 0 .. 0x10_ffff ) {
	next unless chr =~ /\d/;
	printf qq(0x%02X  --> %s\n), $_, charnames::viacode($_);
	}

The number of matches you get depends on the version of Unicode included with that Perl:

0x30  --> DIGIT ZERO
0x31  --> DIGIT ONE
0x32  --> DIGIT TWO
0x33  --> DIGIT THREE
0x34  --> DIGIT FOUR
0x35  --> DIGIT FIVE
0x36  --> DIGIT SIX
0x37  --> DIGIT SEVEN
0x38  --> DIGIT EIGHT
0x39  --> DIGIT NINE
0x660  --> ARABIC-INDIC DIGIT ZERO
0x661  --> ARABIC-INDIC DIGIT ONE
0x662  --> ARABIC-INDIC DIGIT TWO
...
0x1D7FD  --> MATHEMATICAL MONOSPACE DIGIT SEVEN
0x1D7FE  --> MATHEMATICAL MONOSPACE DIGIT EIGHT
0x1D7FF  --> MATHEMATICAL MONOSPACE DIGIT NINE

You can change this with the /a flag:

use charnames qw(:full);

foreach ( 0 .. 0x10_ffff ) {
	next unless chr =~ /\d/a;
	printf qq(0x%02X  --> %s\n), $_, charnames::viacode($_);
	}

That makes \d match only 0 to 9:

0x30  --> DIGIT ZERO
0x31  --> DIGIT ONE
0x32  --> DIGIT TWO
0x33  --> DIGIT THREE
0x34  --> DIGIT FOUR
0x35  --> DIGIT FIVE
0x36  --> DIGIT SIX
0x37  --> DIGIT SEVEN
0x38  --> DIGIT EIGHT
0x39  --> DIGIT NINE

The same goes for whitespace (\s) and word characters (\w).

That’s not all, though. There’s a problem with case insensitive matches. Some of the “wide” characters lowercase into the ASCII range. Well, so far there’s exactly one, but that doesn’t mean there might be more later. The Kelvin symbol, K (U+212A), lowercases into to k (U+004B). To avoid problems with fonts, I create the characters with the \x{} sequences:

use v5.16;

my $lower_k = "\x{006B}";

if( $lower_k =~ /\x{212A}/ ) {
	say "Matches without case insensitivity";
	}

if( $lower_k =~ /\x{212A}/i ) {
	say "Matches with case insensitivity";
	}

if( $lower_k =~ /\x{212A}/ia ) {
	say "Matches with case insensitivity, /a";
	}

if( $lower_k =~ /\x{212A}/iaa ) {
	say "Matches with case insensitivity, /aa";
	}

The Kelvin sign matches with case insensivity, even with the /a flag. However, it doesn’t match when you double up with /aa. That extra /a means, “no really, I mean ASCII”.

Matches with case insensitivity
Matches with case insensitivity, /a

If you’re upgrading a huge codebase and want this behavior on all the regexes in a lexical scope (including a file (Know what creates a scope)), you can Set default regular expression modifiers.

Read perlre for the rest of the story.

Leave a comment

0 Comments.

Leave a Reply


[ Ctrl + Enter ]