Know your sort orders

Once you leave the world of ASCII, things such as string comparisons and sorting get much tougher. In Effective Perl Programming, we devoted a short chapter to Unicode, but there’s a lot more that we could have covered. We mostly ignored the modern idea of locales and Unicode, but those have big effects on how Perl compares characters, and thus, how it orders them with sort.

To figure out which order to use for string comparisons, Perl (or anything else, including you) needs a collation sequence, which is an arbitrary ordering of characters. Why are your (English) abc‘s in that order? Because someone decided that was the order. If they had decided differently, you would never have noticed the difference.

In old Perl, ASCII was king. The American Standard Code for Information Exchange defined the order of 128 characters starting with the null byte, \x00 and going up to the delete character, \x7F so everyone (in the United States) would agree on what bits meant what. It was part of the Great Society. That order isn’t the same thing as you might expect from your abc‘s. Between the capital letters and the lowercase ones, you’ll find an island of punctuation.

When you use Perl’s sorting on those characters, you get their ASCII order:

use 5.010;
say join " ", sort qw( B u s t e r );
B e r s t u

With UTF-8, you get the same sort because those characters are in the same order in that collation. UTF-8 included ASCII as a subset for an easy transition between the old and the new.

Notice that in the ASCII and UTF-8 collations, the uppercase letters come before the lowercase letters. Why? Because that’s how they did it. Why do you keep asking these impossible questions? The collation has to choose exactly one sequence, and if it were some other sequence, another group of people would complain.

use 5.010;
say join " ", sort qw( a A C c b B );
A B C a b c

And, here’s the rub. The ASCII or UTF-8 collations may not be the order that you want. That ordering doesn’t often make sense in the real world. There are other ways to sequence letters, such as phone book and dictionary sorts. A language, such as German, may decide to sort the same characters differently from another language, and even within a language, different regions, such as Switzerland and Germany, may choose a different sequence. For the most part, Perl doesn’t care about any of those, at least until you tell it to, and even then it might not given you the right answer.

Locales

The locale tells Perl to ignore its built-in collation in favor of one that you specify either in the environment or a particular setting. You can the POSIX module allows you to change the locale within your program:

use POSIX qw(locale_h);

use utf8;
binmode STDOUT, ":utf8";

$default = 'en_US.ASCII';

my @letters = qw( A B C Z a b c z ø ö ä å );

my @locales = qw( # list yours with locale -a
en_US.ISO8859-1
en_US.UTF-8
fr_CA.ISO8859-1
fr_CA.UTF-8
el_GR.ISO8859-7
fr_CH.ISO8859-1
de_CH.ISO8859-1
de_DE.ISO8859-1
sv_SE.ISO8859-1
pt_PT.ISO8859-1
da_DK
da_DK.ISO8859-1
da_DK.ISO8859-15
da_DK.UTF-8
tr_TR.ISO8859-9
);

foreach my $locale ( @locales ) {
	use locale;
	setlocale(LC_COLLATE, $locale);
	printf "%-16s: %s\n", $locale, join $", sort @letters;
	}

The output shows that even in the same encoding (such as ISO8859-1), the collation sequences changes for some languages, such as in da_DK.ISO8859-1 and da_DK.ISO8859-15. Notice how the ö and ø move around, for instance. Those characters come in different orders merely because those locales define them in different orders:

en_US.ISO8859-1 : A ä å ø ö B C Z a b c z
en_US.UTF-8     : A B C Z a b c z ä å ö ø
fr_CA.ISO8859-1 : A ä å ø ö B C Z a b c z
fr_CA.UTF-8     : A B C Z a b c z ä å ö ø
el_GR.ISO8859-7 : A B C Z ö ø ä å a b c z
fr_CH.ISO8859-1 : A ä å ø ö B C Z a b c z
de_CH.ISO8859-1 : A ä å ø ö B C Z a b c z
de_DE.ISO8859-1 : A ä å ø ö B C Z a b c z
sv_SE.ISO8859-1 : A ä å ø ö B C Z a b c z
pt_PT.ISO8859-1 : A ä å ø ö B C Z a b c z
da_DK           : A B C Z a b c z ä å ö ø
da_DK.ISO8859-1 : A ä å ø ö B C Z a b c z
da_DK.ISO8859-15: A ä å ö ø B C Z a b c z
da_DK.UTF-8     : A B C Z a b c z ä å ö ø
tr_TR.ISO8859-9 : A B C Z a b c z ä å ö ø

Fortunately, you don’t get any of these problems unless you specifically use the locale pragma.

Unicode

Even if you get the right collation, how do you sort Ä and Ä? Those are the same you say? Well, they certainly look the same, but one of those is a single character and the other is two characters. Look at the HTML source if you don’t believe me. Okay, I’ll just show you: one is Ä and the other is Ä. Those should look the same in a web browser that not only supports Unicode, but uses a font that knows what to do with a combining character (one of the problems of Unicode support is that not only does your program need to support it, but so does everything else). In Item 77. Work with graphemes instead of characters we told you all about combining characters and that’s what you see here when you try to sort them:

binmode STDOUT, ':utf8';

my @unsorted = (
	qw( A B C Z ),
	chr( 196 ),
	"fi",
	"\x{FB01}",
	"A\x{0308}",
	qw( a b c z ),
	);

# print the characters along with their ordinal value
my @strings =
	map { my $s = $_; $s .= " => " . s/(.)/sprintf "%04x", ord($1)/reg; $s }
	@unsorted;

print join "\n", @strings, '';

The output shows that the decomposed version of Ä sorts right after the unadorned A because the first character in each is the same. The composed version shows up much latter, after the lowercase letters, because it has a higher ordinal value:

A => 0041
Ä => 00410308
B => 0042
C => 0043
Z => 005a
a => 0061
b => 0062
c => 0063
z => 007a
Ä => 00c4

Unicode defines various ways, called normalization forms, to straighten this out. You can read about the details in the Unicode Standard Annex #15: Unicode Normalization Forms if you like, but you can also just use the Unicode::Normalize, which has subroutines to handle it all for you.

Normalization Form D (“D” for canonical decomposition) takes the composed characters and turns them into the equivalent decomposed characters. You can use a Schwartzian Transform (Item 22. Learn the myriad ways of sorting) to get the NFD form and sort on it while still keeping the original string:

binmode STDOUT, ':utf8';

my @unsorted = (
	qw( A B C Z ),
	chr( 196 ),
	"fi",
	"\x{FB01}",
	"A\x{0308}",
	qw( a b c z ),
	);

use Unicode::Normalize;

my @strings =
	map  { $_->[0] }
	sort { $a->[1] cmp $b->[1] }
	map  { [ $_, NFD( $_ ) ] }
	@unsorted;

print join "\n", @strings, '';

The output shows that the two versions of Ä now sort next to each other (but which one is which?):

A
Ä
Ä
B
C
Z
a
b
c
fi
z
fi

Notice, though, that the fi and ? don’t sort next to each other. That second one is a ligature between the f and the i. NFD doesn’t break those apart because its canonical equivalence deals with combinations of characters that form the same abstract character (the ones that look the same when you print them). There’s another form of equivalence, compatibility equivalence, where the same sequence of characters might have a different appearance when combined, as ligatures do. The normalization form, NFKD (“K” for kompatibility decomposition) does what you need in this case:

binmode STDOUT, ':utf8';

my @unsorted = (
	qw( A B C Z ),
	chr( 196 ),
	"fi",
	"\x{FB01}",
	"A\x{0308}",
	qw( a b c z ),
	);

use Unicode::Normalize;

my @strings =
	map  { $_->[0] }
	sort { $a->[1] cmp $b->[1] }
	map  { [ $_, NFKD( $_ ) ] }
	@unsorted;

print join "\n", @strings, '';

Now the fi and ? sort next to each other because the

A
Ä
Ä
B
C
Z
a
b
c
fi
fi
z

The NFKD sorting is probably want you want most of the time if you’re making lists for other people to read, especially if you have to deal with fancy text that a typographer wants to force on you.

However, if you need to a different order, you can use the Unicode::Collate module. You have to know quite a bit about DUCET, the Default Unicode Collation Element Table, which defines the, well, what it says in the name, but with that knowledge, you can redefine the world.

Things to remember

  • Perl sorts according to a collation sequence, whether that’s ASCII, UTF-8, or something locale-specific.
  • The same language may sort differently based on regional differences, or the defintion of its locale.
  • To sort Unicode strings into something a human might expect, you need to decompose them before you compare them.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Leave a comment

0 Comments.

Leave a Reply

You must be logged in to post a comment.