Know your character classes under different semantics

Now the Perl is Unicode aware (and, it’s been that way for a long time even if you haven’t been), you might have to be more careful in your regular expressions. Some of the character classes are much more inclusive than the ASCIIphile might imagine. In ASCIIland, character and byte semantics are the same thing. No matter which way you treat your strings you get the same answer. With Unicode, however, Perl might now treat certain sequences of bytes as one character. The character and byte semantics have diverged. If you let Perl treat your data as character data when it really isn’t, you can run into problems. If you aren’t already doing something special, you’re probably using character semantics.

Whitespace

How many whitespace characters can you name? Think about that for a moment; don’t cheat by scanning ahead. If I was clever I’d have some sort of Javascript thing that would make you wait, or at least take 15 seconds to subvert, but all I have is this sentence.

Ready? How many did you get? Most people can name at least three ASCII whitespace characters. Some people can name four:

  • space (0x20)
  • carriage return (0x0A)
  • newline (0x0D)
  • horizontal tab (0x09)

There are a few others though. If you’re an olde tyme unix geek, you might also get the two other less known ones:

  • vertical tab (0x0B)
  • form feed (0x0C)

How can you get all of the ASCII whitespace is you didn’t know they were already? You would think that you could just whip up a quick one-liner to do the trick:

$ perl5.10.1 -le 'for(0..127){next if chr =~ /\S/; printf qq(0x%02X\n), $_}'
0x09
0x0A
0x0C
0x0D
0x20

That’s only five of them though. Perl’s \s character class apparently doesn’t match the vertical tab. It’s not exactly documented that way in perlre, which just says that \s matches whitespace. However, you can surmise that it’s different from the POSIX definition of whitespace because perlre documents the POSIX character class [[:space:]] as \s along with the vertical tab. Adjusting your one-liner to use the POSIX definition instead (and a negated binding operator) gets you that vertical tab (0x0B):

$ perl -le 'for(0..127){next if chr !~ /[[:space:]]/; printf qq(0x%02X\n), $_}'
0x09
0x0A
0x0B
0x0C
0x0D
0x20

It doesn’t stop there. You didn’t think that you’d get all those fancy characters in Unicode without a bunch of fancy whitespace to go with them, did you? First, adjust your one-liner to use a Unicode property (Item 76. Match Unicode characters and properties) instead of a character class:

$ perl -le 'for(0..127){next if chr !~ /\p{Space}/; printf qq(0x%02X\n), $_}'
0x09
0x0A
0x0B
0x0C
0x0D
0x20

You want to get more fancy though, so you can abandon the one-liner. You want to get the character names too (Item 74. Specify Unicode characters by code point or name). Pull in the charnames module to turn the code point into the name:

use 5.010;

use charnames qw(:full);

foreach ( 0 .. 127 ) {
	next unless chr =~ /\p{Space}/;
	printf qq(0x%02X  --> %s\n), $_, charnames::viacode($_);
	}

Now you know what those numbers represent:

0x09  --> CHARACTER TABULATION
0x0A  --> LINE FEED (LF)
0x0B  --> LINE TABULATION
0x0C  --> FORM FEED (FF)
0x0D  --> CARRIAGE RETURN (CR)
0x20  --> SPACE

Okay, so how much whitespace can you find if you go up to 10ffff (the last Unicode “character” code point)?

0x09  --> CHARACTER TABULATION
0x0A  --> LINE FEED (LF)
0x0B  --> LINE TABULATION
0x0C  --> FORM FEED (FF)
0x0D  --> CARRIAGE RETURN (CR)
0x20  --> SPACE
0x85  --> NEXT LINE (NEL)
0xA0  --> NO-BREAK SPACE
0x1680  --> OGHAM SPACE MARK
0x180E  --> MONGOLIAN VOWEL SEPARATOR
0x2000  --> EN QUAD
0x2001  --> EM QUAD
0x2002  --> EN SPACE
0x2003  --> EM SPACE
0x2004  --> THREE-PER-EM SPACE
0x2005  --> FOUR-PER-EM SPACE
0x2006  --> SIX-PER-EM SPACE
0x2007  --> FIGURE SPACE
0x2008  --> PUNCTUATION SPACE
0x2009  --> THIN SPACE
0x200A  --> HAIR SPACE
0x2028  --> LINE SEPARATOR
0x2029  --> PARAGRAPH SEPARATOR
0x202F  --> NARROW NO-BREAK SPACE
0x205F  --> MEDIUM MATHEMATICAL SPACE
0x3000  --> IDEOGRAPHIC SPACE

Did you what to potentially match those extra characters? Did you even know they existed? And that’s just using the Unicode property \p{Space}. What happens if you use the \s instead? It turns out that you get almost exactly the same thing with one difference: the \s still doesn’t match the vertical tab.

Perl 5.10 added the \h and \v{Space} character classes for horizontal and vertical whitespace. How do those hold up? Make a table of all the sorts of whitespace and how they match:

use 5.010;

use charnames qw(:full);

print <<"LEGEND";
s   matches \\s, matches Perl whitespace
h   matches \\h, horizontal whitespace
v   matches \\v, vertical whitespace
p   matches [[:space:]], POSIX whitespace
all characters match Unicode whitespace, \\p{Space}

LEGEND

printf qq(%s %s %s %s  %-7s --> %s\n),
	qw( s h v p  Ordinal  Name );
print '-' x 50, "\n";

foreach my $ord ( 0 .. 0x10ffff ) {
	next unless chr($ord) =~ /\p{Space}/;
	my( $s, $h, $v, $posix ) = 
		map { chr($ord) =~ m/$_/ ? 'x' : ' ' } 
			( qr/\s/, qr/\h/, qr/\v/, qr/[[:space:]]/ );  
	printf qq(%s %s %s %s  0x%04X  --> %s\n),
		$s, $h, $v, $posix,
		$ord, charnames::viacode($ord);
	}

The output shows that shows there are several different definitions of whitespace:

s   matches \s, matches Perl whitespace
h   matches \h, horizontal whitespace
v   matches \v, vertical whitespace
p   matches [[:space:]], POSIX whitespace
all characters match Unicode whitespace, \p{Space}

s h v p  Ordinal --> Name
--------------------------------------------------
x x   x  0x0009  --> CHARACTER TABULATION
x   x x  0x000A  --> LINE FEED (LF)
    x x  0x000B  --> LINE TABULATION
x   x x  0x000C  --> FORM FEED (FF)
x   x x  0x000D  --> CARRIAGE RETURN (CR)
x x   x  0x0020  --> SPACE
    x    0x0085  --> NEXT LINE (NEL)
  x      0x00A0  --> NO-BREAK SPACE
x x   x  0x1680  --> OGHAM SPACE MARK
x x   x  0x180E  --> MONGOLIAN VOWEL SEPARATOR
x x   x  0x2000  --> EN QUAD
x x   x  0x2001  --> EM QUAD
x x   x  0x2002  --> EN SPACE
x x   x  0x2003  --> EM SPACE
x x   x  0x2004  --> THREE-PER-EM SPACE
x x   x  0x2005  --> FOUR-PER-EM SPACE
x x   x  0x2006  --> SIX-PER-EM SPACE
x x   x  0x2007  --> FIGURE SPACE
x x   x  0x2008  --> PUNCTUATION SPACE
x x   x  0x2009  --> THIN SPACE
x x   x  0x200A  --> HAIR SPACE
x   x x  0x2028  --> LINE SEPARATOR
x   x x  0x2029  --> PARAGRAPH SEPARATOR
x x   x  0x202F  --> NARROW NO-BREAK SPACE
x x   x  0x205F  --> MEDIUM MATHEMATICAL SPACE
x x   x  0x3000  --> IDEOGRAPHIC SPACE

Digits

It’s not just whitespace either. What about the digits? Most people expect \d to match only the characters in the set (0, 1, 2, 3, 4, 5, 6, 7, 8, 9). Try it:

use charnames qw(:full);

binmode STDOUT, ':utf8';

foreach ( 0 .. 0x10FFFF ) {
	next unless chr =~ /\d/;
	printf qq(0x%04X  %s  --> %s\n), $_, chr, charnames::viacode($_);
	}

You get hundreds of lines of output:

0x0030  0  --> DIGIT ZERO
0x0031  1  --> DIGIT ONE
0x0032  2  --> DIGIT TWO
0x0033  3  --> DIGIT THREE
0x0034  4  --> DIGIT FOUR
0x0035  5  --> DIGIT FIVE
0x0036  6  --> DIGIT SIX
0x0037  7  --> DIGIT SEVEN
0x0038  8  --> DIGIT EIGHT
0x0039  9  --> DIGIT NINE
0x0660  ٠  --> ARABIC-INDIC DIGIT ZERO
0x0661  ١  --> ARABIC-INDIC DIGIT ONE
0x0662  ٢  --> ARABIC-INDIC DIGIT TWO
0x0663  ٣  --> ARABIC-INDIC DIGIT THREE
...

If you only wanted the Arabic numerals (which aren’t the ones with “ARABIC” in
their name), you can’t rely on \d.

Word characters

From the early days of Perl, you’ve been told that \w is the set of characters that you can legally use to name Perl variables (that is, “identifier characters”). Before Perl’s Unicode awareness, that was the rather limited set of [A-Za-z0-9_]. The perlre documents it as “alphanumerics plus underscore”, but it doesn’t define the set of alphanumerics. In Item 72. Use Unicode in your source code, you saw how to use many of the Unicode characters as variable names:

my $π = 3.14159265;

Those are legal identifier characters, so \w is going to match them. If you were expecting something else, you might be in for a surprise.

A possible fix

Besides making more specific character classes without using the character class shortcuts, how can you avoid this? The problem is all the Unicode nonsense and that Perl is handling the strings as Unicode strings. Another way to say that is that Perl uses character semantics normally. If Perl treated your string as octets instead, you’re back to ASCII semantics for \d, \w, and \s. The bytes pragma is lexical, so it only affects strings temporarily:

foreach ( 0 .. 0xFF ) {
	use bytes;
	next unless chr =~ /\w/;
	printf qq(0x%04X  %s  --> %s\n), $_, chr;
	}

If you’re playing with binary data, tell Perl that you’re playing with binary data.

3 thoughts on “Know your character classes under different semantics”

  1. About word characters:

    $ perl -Mutf8 -E’use constant π => atan2( 0,-1 );print π,”\n”;’
    Wide character in print at -e line 1.
    π
    $ perl -Mutf8 -E’sub π (){ atan2( 0,-1 )};print π,”\n”;’
    3.14159265358979
    $

    I wonder why.

  2. I ran into this problem a couple of weeks ago, but I forget what I was doing. Where I thought I had a bareword filehandle, it assumed the bareword as a string of some sort. I don’t know why that happens.

Comments are closed.