Be careful with Unicode character ranges

Unicode character ranges have the same gotchas as the ASCII character ranges, although they become more apparent and more important. You’re probably used to creating a range for all the letters, like the character classes [A-Z] or [a-z], the range 'a' .. 'z', or the range in a transliteration, and not having a problem. If you look at the ASCII sequence, you see that there is an unbroken series of letters in those ranges.

The range operator

The range operator, .., is magical when you give it strings. It does not follow any one rule. That you can get a list of the letters in ASCII is a special case, not a general feature. You can get the lowercase letters or the uppercase letters:

$ perl5.14.1 -E "say 'a' .. 'z'"
abcdefghijklmnopqrstuvwxyz
$ perl5.14.1 -E "say 'A' .. 'Z'"
ABCDEFGHIJKLMNOPQRSTUVWXYZ

If you try to get a different range, you get odd results. For instance, starting with an uppercase A to a lowercase z, or the other way around, you don’t get the upper- and lowercase letters:

$ perl5.14.1 -E "say 'A' .. 'z'"
ABCDEFGHIJKLMNOPQRSTUVWXYZ
$ perl5.14.1 -E "say 'a' .. 'Z'"
abcdefghijklmnopqrstuvwxyz

This is documented in the single sentence in perlop:

If the final value specified is not in the sequence that the magical increment would produce, the sequence goes until the next value would be longer than the final value specified.

You have to look back at the documentation for the increment operators:

The auto-increment operator has a little extra builtin magic to it. If you increment a variable that is numeric, or that has ever been used in a numeric context, you get a normal increment. If, however, the variable has been used in only string contexts since it was set, and has a value that is not the empty string and matches the pattern /^[a-zA-Z]*[0-9]*\z/, the increment is done as a string, preserving each character within its range…

There are a few things that you might not notice there. First, this magic only works on three different sets: the lowercase letters of the English alphabet, the uppercase letters of the English alphabet, and the arabic numerals. These three sets are special with increments simply because they are special. It’s built into ++:

$ perl5.14.1 -E '$s = q(a); say $s++ for (0 .. 5)'
a
b
c
d
e
f

This magical increment will also carry, but only within the same set:

$ perl5.14.1 -E '$s = q(w); say $s++ for (0 .. 5)'
w
x
y
z
aa
ab

The range operator merely does an increment from the starting point to the ending point, which don’t have to be single characters:

$ perl5.14.1 -E 'say join q( ), q(am) .. q(bc)'
am an ao ap aq ar as at au av aw ax ay az ba bb bc

If you try to mix the sets, you get odd (but documented results). You can’t get a range that is all the uppercase and lowercase letters. The range operator starts incrementing the uppercase A, progressing through its set. When it gets to the end of the set, it realizes that the endpoint of the range is not in the set so it stops:

$ perl5.14.1 -E 'say q(A) .. q(z)'
ABCDEFGHIJKLMNOPQRSTUVWXYZ
$ perl5.14.1 -E "say 'A' .. ';'"
ABCDEFGHIJKLMNOPQRSTUVWXYZ
$ perl5.14.1 -E "say 'A' .. '0'"
ABCDEFGHIJKLMNOPQRSTUVWXYZ
$ perl5.14.1 -E "say 'a' .. '0'"
abcdefghijklmnopqrstuvwxyz
$ perl5.14.1 -E "say 'a' .. 'A'"
abcdefghijklmnopqrstuvwxyz
$ perl5.14.1 -E "say '5' .. 'A'"
56789
$ perl5.14.1 -E "say '5' .. 'b'"
56789

You get what appears to a partial answer and you don’t get a warning.

If you try the range operator with anything other than what’s in one of those three sets, you only get the first element back. For instance, you can’t get the range of Greek letters:

$ perl5.14.1 -CS -Mcharnames=greek -E 'say "\N{alpha}" .. "\N{omega}"'
α

If you want to get all of the Greek letters with the range operator, you have to start with their ordinal value with ord then convert back to the character with chr:

$ perl5.14.1 -CS -Mcharnames=greek -E 'say map chr, ord "\N{alpha}" .. ord "\N{omega}"'
αβγδεζηθικλμνξοπρςστυφχψω

But, don’t be fooled. That doesn’t really work. Try it with the uppercase Greek letters:

$ perl5.14.1 -CS -Mcharnames=greek -E 'say map chr, ord "\N{Alpha}" .. ord "\N{Omega}"'
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡ΢ΣΤΥΦΧΨΩ

Can you see the problem? Between Ρ and Σ there’s ΢. What the heck is that? Ρ is ɢʀᴇᴇᴋ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ʀʜᴏ (U+03A1) and Σ is ɢʀᴇᴇᴋ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ꜱɪɢᴍᴀ (U+03A3). There’s a gap! U+03A2 is a reserved code point. Why is that?

There’s another problem which you might have missed at first. Look at the list of lowercase characters again. What’s odd about it (if you have never experienced Greek, you probably won’t see it)? Look at the sequence ρςστυ. There’s two sigma characters there: Ï‚ is GREEK SMALL LETTER FINAL SIGMA (U+03C2), and σ is GREEK SMALL LETTER SIGMA (U+03C3). You use the Ï‚ form when it appears as the last letter in the word. However, the Ï‚ doesn’t have its own uppercase form. As such, the Unicode Character set leaves a gap where that non-existent capital Ï‚ would go.

That’s going to cause problems in other areas too.

Transliteration

You often find ranges in the transliteration operator:

$string =~ tr/a-z/A-Z/;

In that case, the first character on the righthand side replaces every instance of the character in the same position on the lefthand side. And, that’s what happens, but some characters disappear:

use utf8;
my $string = 'αβγδεζηθικλμνξοπρςστυφχψω';
$string =~ tr/α-ω/Α-Ω/;
say $string:

The ? disappears:

ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡ΢ΣΤΥΦΧΨΩ

This is the problem inherent with ASCII mindsets in Unicode worlds. You can’t just take ordinal values and assume their case-mapped partners are at a consistent offset. You might think that you can just handle the special case:

use utf8;
my $string = 'αβγδεζηθικλμνξοπρςστυφχψω';
$string =~ tr/α-ω/Α-ΡΣΣ-Ω/;
say $string:

Certainly, that works, but only for the same reason as the tr/a-z/A-Z/ does: you know something about the order of the characters in the collation. You might think that you could build the replacement list by letting the uc operator figure out what each character should become:

use utf8;
use 5.014;

my $lc_string = 'αβγδεζηθικλμνξοπρςστυφχψω';
my $uc_string = uc( $lc_string );

my $string = 'Σίσυφος ΣΊΣΥΦΟΣ';  # Sisyphus, Lowercased and UPPERCASED

$string =~ tr/$lc_string/$uc_string/; # WRONG

say $string;

That doesn’t work because the transliteration operator doesn’t interpolate. The examples in perlop tell you to use string eval (but recall the Item Know the two different forms of eval):

# from perlop
eval "tr/$oldlist/$newlist/";
die $@ if $@;

eval "tr/$oldlist/$newlist/, 1" or die $@;

Regular expressions

You can also use ranges in regular expression character classes. While specifying a range of characters, you want to only specify those that you want to match. As you have seen, though, ranges might include characters that shouldn’t be there.

use utf8;
use 5.014;

if( '΢' =~ m/[?-??-?]/ ) { # U+03A2 is reserved
	say 'Matched reserved character ?!'
	}

This matches, although you probably didn’t want to match that reserved character U+03A2, but those ranges really think about ordinal ranges instead of the logical ranges that you intend. How much do want to have your nose in the Unicode Character Set to verify all of your ranges?

The fix depends on what you are doing. If you just want to match any Greek characters, you can look for the IsGreek property instead. That doesn’t include the the reserved character:

use utf8;
use 5.014;

if( '΢' =~ m/\p{Greek}/ ) {
	say 'Matched reserved character ΢!'
	}
else {
	say q(Didn't match a Greek character.)
	}

The output shows that you don’t match the reserved character:

Didn't match a Greek character.

Things to remember

  • The ranges of ASCII letters are special cases
  • Unicode character ranges might have extra characters that you don’t intend to put in your logical set
  • Character ranges are vestiges of ASCII-thought and don’t work well with Unicode
  • Avoid character ranges if you can find another way to do it
Leave a comment

0 Comments.

Leave a Reply


[ Ctrl + Enter ]

7ads6x98y