Fold cases properly

You might think that you know how to compare strings regardless of case, and you’re probably wrong. After you read this Item, you’ll be able to do it correctly and without doing any more work than you were doing before. Perl handles all the details for you.

If you grew up in the ASCII world, case insensitivity is a difference of literally one bit, so changing case is setting or unsetting a bit in the octet that represents that character.

If you’ve read the Perl FAQ, you may have seen this quip:

“Perl” is the name of the language. Only the “P” is capitalized. The name of the interpreter (the program which runs the Perl script) is “perl” with a lowercase “p”.

When Larry Wall was asked what the difference between “Perl” and “perl”, he said “One bit”. It’s literally a difference of flipping one bit in the ASCII representation. That’s as complicated as ASCII case folding gets.

The capital letter P has the ordinal value 0b1010000. The small letter p, which shows up later in the ASCII sequence, has the ordinal value 0b1110000. This makes it extremely easy to write routines to change between upper and lower cases:

use v5.10;

say "  U L";
say "-----";

foreach my $char ( qw(p P a b c A B C) ) {
	my $lower = chr( ord($char) | 0b0100000 );
	my $upper = chr( ord($char) & 0b1011111 );
	
	say "$char $upper $lower";
	}

The output shows what you’d expect for the upper and lower cases:

  U L
-----
p P p
P P p
a A a
b B b
c C c
A A a
B B b
C C c

Since bit flipping is easy to do, it’s very easy for even primitive computers to quickly change case (assuming that you’re not so primitive as to not have two cases). But, this only works if you restrict the output to the ASCII letters. If you want to handle non-letters, you have to do a bit more work to ensure that you don’t shift them into other characters:

use v5.10;

say "  U L";
say "-----";

foreach my $char ( qw(p P a b c A B C # !) ) {
	my $upper = uppercase( $char );
	my $lower = lowercase( $char );

	say "$char $upper $lower";
	}
 
 sub lowercase {
 	my $_ = shift;
  	my $ord = ord();

 	return $_ unless $ord >= 0x41 and $ord <= 0x5A;
	return chr( $ord ^ 0b100000 );
	}

 sub uppercase {
 	my $_ = shift;	
 	my $ord = ord();
 	
 	return $_ unless $ord >= 0x61 and $ord <= 0x7A;
	return chr( $ord ^ 0b100000 );
	}

Now the non-letters stay the same character:

  U L
-----
p P p
P P p
a A a
b B b
c C c
A A a
B B b
C C c
# # #
! ! !

This almost works for Latin-* encodings too. When you move out of the ASCII sequence into Unicode, you don't have this luxury, and it's not merely a representational issue.

If you were infected with ASCII early, you've grown up thinking that you can go back and forth between upper and lower cases and always get the same result. Outside of ASCII, that's not necessarily true. Consider the word "Reichwaldstraße", a common street name in Germany. The "straße" has the special character ß (U+00DF LATIN SMALL LETTER SHARP S). which is a ligature of a long s, the fancy ſ (U+017F LATIN SMALL LETTER LONG S) that you may have seen in historical documents, and the familiar short s. Put them together, ſs, and move them close enough and you can see how you would end up with ß once you connect the hanging portion of the long s with the top of the short s. The UCS has an uppercase version (U+1E9E LATIN CAPITAL LETTER SHARP S), although no one uses it aside from saying that no one uses it. U+1E9E lowercases to U+00DF, but U+00DF has no single character uppercase version; it's the two characters SS. The lowercase of SS, however, is ss:

use utf8;

my $string = "Reichwaldstraße";

my $upper = uc( $string );
my $lower = lc( $upper  );

print <<"HERE";
Started with: $string
Upper:        $upper
Lower:        $lower
HERE

The output shows that you don't get back to the original:

Started with: Reichwaldstraße
Upper:        REICHWALDSTRASSE
Lower:        reichwaldstrasse

There's another s that causes problems: the Greek sigma, which comes in two lowercase forms. One appears in the middle of words and the other appears at the end, as in Ὀδυσσεύς (Odysseus), where σ and ς represent the same thing, just in different forms mandated by their position:

use utf8;

my $char = "Ὀδυσσεύς";

my $upper = uc( $char );
my $lower = lc( $upper );

print <<"HERE";
Started with: $char
Upper:        $upper
Lower:        $lower
HERE

Again, the lowercase version at the end is different than what you started with:

Started with: Ὀδυσσεύς
Upper:        ὈΔΥΣΣΕΎΣ
Lower:        ὀδυσσεύσ

This means that you can't merely use lc to normalize text for case insensitive comparison. These won't compare correctly:

lc( "Reichwaldstraße" ) eq lc( "REICHWALDSTRASSE" );  # Nope!
lc( 'Ὀδυσσεύς' ) eq lc( 'ὈΔΥΣΣΕΎΣ' );                         # Nope!

You might object that these are different strings and that they shouldn't be the same, but where did these strings start? Perhaps that REICHWALDSTRASSE was not originally all uppercase, but changed by some stupid filters between you and the original information (and with a name like mine, I know about stupid casing filters). That's part of the ASCII infection.

So, lc is the wrong way. Sadly, we do this incorrectly in Learning Perl, when we show this subroutine we want to sort:

sub case_insensitive { "\L$a" cmp "\L$b" }

The Unicode specification solves this with its case folding rules. In short, it folds characters with different case forms into a common form. There's not a rule for this; they do it by exhaustion, specifying the common form for each fold. The common form is defined in the Unicode Character Database, which the Perl developers have digested into the files you find in the unicore/ directory in your Perl library. Here's a few lines from unicore/CaseFolding.txt:

0050; C; 0070; # LATIN CAPITAL LETTER P
0051; C; 0071; # LATIN CAPITAL LETTER Q
0052; C; 0072; # LATIN CAPITAL LETTER R
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
03A3; C; 03C3; # GREEK CAPITAL LETTER SIGMA
03C2; C; 03C3; # GREEK SMALL LETTER FINAL SIGMA
FB00; F; 0066 0066; # LATIN SMALL LIGATURE FF
FB01; F; 0066 0069; # LATIN SMALL LIGATURE FI
FB02; F; 0066 006C; # LATIN SMALL LIGATURE FL
FB03; F; 0066 0066 0069; # LATIN SMALL LIGATURE FFI
FB04; F; 0066 0066 006C; # LATIN SMALL LIGATURE FFL

The first column is the code number of the original character, the second is the type of folding (explained in the data file and coming up later), and the third column are the code numbers that form the common, folded ("equivalent") version. Essentially, it's a big hash. Notice that some of the folded versions are multiple characters. You're not going to get that with bit fiddling.

Case folding takes the character in the first column and turns them into the characters in the third column, then takes the result and does it again until there are no more folds possible. It keeps doing that until there is nothing to replace. Characters that don't have an entry in this file fold into themselves. You case fold to compare strings, not to normalize strings for storage or other uses. Case folding makes case insensitive comparisons very fast, but it also loses information that you can't recover. You can read the exact rules in Section 5.18, "Case mapping", of the Unicode Standard.

To see how that works, try that with Reichwaldstraße and Ὀδυσσεύς. All characters except two stay the same, and two use the mapping from unicore/CaseFolding.txt:

  • Reichwaldstraße → reichwaldstrasse
  • REICHWALDSTRASSE → reichwaldstrasse
  • Ὀδυσσεύς → ὀδυσσεύσ
  • ὈΔΥΣΣΕΎΣ → ὀδυσσεύσ

To implement these operations, Perl v5.16 adds the fc built-in function. Instead of lc, use that:

use v5.16;
fc( "Reichwaldstraße" ) eq fc( "REICHWALDSTRASSE" );  # Yep!
fc( 'ὈΔΥΣΣΕΎΣ' ) eq fc( 'Ὀδυσσεύς' );                         # Yep!

If you don't have v5.16, you can use the fc front the Unicode::CaseFold module on CPAN.

If you wanted to do this inside a double-quoted string, you can use the \F case shift operator (but be aware of the things we noted in Understand the order of operations in double quoted contexts). Our Learning Perl example could change to:

sub case_insensitive { "\F$a" cmp "\F$b" }

More complicated folds

Looking back at the extract of unicore/CaseFolding.txt, you might remember that I skipped over the second column, the mapping status. Those letters stand for different folding rules:

  • C: common case folding
  • F: full case folding (strings may grow in length)
  • S: simple case folding (map to single characters)
  • T: special case for uppercase I and dotted uppercase I

The "T" status stands in for folds that the general rules can't handle, mostly some characters from Turkish and similar languages.

So far, Perl's fc only handles the "F" status for full case folding. It doesn't handle the special folding you'll find in unicore/SpecialCasing.txt that has the oddball situations, such as multiple source characters folding onto other multiple characters. If you want to handle those, you're on your own, although the Unicode::Casing module on CPAN might help.

Many of the folding rules depend on the source language, so you'll probably want to pay special attention if you are using that language or completely ignore them if you are not.

Besides that, the Universal Character Set gives people much more of a chance to mess up. Suppose that you want to write "β-carotene", that thing you get from carrots. That first character is β (U+03B2 K SMALL LETTER BETA). Some people might think it looks like ß (U+00DF LATIN SMALL LETTER SHARP S), and that's good enough for them. No amount of case folding is going to let you know that someone used an incorrect character. But, this is also one of the benefits of Unicode: characters know what they are.

Another correct way

There's another correct way to check strings regardless of case. You can use the /i flag on the match operator. The Unicode-aware Perl regex engine handles the rest:

use utf8;
use v5.18;

use Set::CrossProduct;

my $string = "Reichwaldstraße";

my $upper = uc( $string );
my $lower = lc( $upper  );

my $sets = Set::CrossProduct->new(
	[
	[ $string, $upper, $lower ],
	[ $string, $upper, $lower ],
	]
	);

foreach my $tuple ( $sets->combinations ) {
	my( $l, $r ) = @$tuple;
	next if $l eq $r;

	say "lc($r) eq lc($l)  ? ", lc($r) eq lc($l) ? "matched" : "failed";
	say "fc($r) eq fc($l)  ? ", fc($r) eq fc($l) ? "matched" : "failed";
	say "$r =~ m/$l/i      ? ", $l =~ m/$r/i ? "matched" : "failed";
	
	say;
	}

In the output, you can see that lc sometimes fails, but that the fc and m//i always works:

lc(REICHWALDSTRASSE) eq lc(Reichwaldstraße)  ? failed
fc(REICHWALDSTRASSE) eq fc(Reichwaldstraße)  ? matched
REICHWALDSTRASSE =~ m/Reichwaldstraße/i      ? matched

lc(reichwaldstrasse) eq lc(Reichwaldstraße)  ? failed
fc(reichwaldstrasse) eq fc(Reichwaldstraße)  ? matched
reichwaldstrasse =~ m/Reichwaldstraße/i      ? matched

lc(Reichwaldstraße) eq lc(REICHWALDSTRASSE)  ? failed
fc(Reichwaldstraße) eq fc(REICHWALDSTRASSE)  ? matched
Reichwaldstraße =~ m/REICHWALDSTRASSE/i      ? matched

lc(reichwaldstrasse) eq lc(REICHWALDSTRASSE)  ? matched
fc(reichwaldstrasse) eq fc(REICHWALDSTRASSE)  ? matched
reichwaldstrasse =~ m/REICHWALDSTRASSE/i      ? matched

lc(Reichwaldstraße) eq lc(reichwaldstrasse)  ? failed
fc(Reichwaldstraße) eq fc(reichwaldstrasse)  ? matched
Reichwaldstraße =~ m/reichwaldstrasse/i      ? matched

lc(REICHWALDSTRASSE) eq lc(reichwaldstrasse)  ? matched
fc(REICHWALDSTRASSE) eq fc(reichwaldstrasse)  ? matched
REICHWALDSTRASSE =~ m/reichwaldstrasse/i      ? matched

The match operator isn't useful for sort though, since you can only tell if the strings are the same.

Things to remember

  • Case-folding is more complicated than merely lowercasing.
  • The fc does proper case folding according to the Unicode standard.
  • The \F case fold operator does full case folding in double-quoted contexts.

4 thoughts on “Fold cases properly”

  1. Hebrew and Arabic are peculiar too:

    These two also have different forms for the “same” letter, position dependent. “Same” in the sense that all forms are called the same, pronounced the same, etc.

    Hebrew has five that are different when at the end of the word. Even that has exceptions, e.g., when spelling foreign words that were adopted into the language.

    Arabic is a lot more complex. Most letters have four forms: as isolated letters, at the beginning of the word, in the middle of it and at the end of it. And there is more! There are different forms to a given letter depending on adjacent characters, different when appearing to the left or to the right.

    And these two languages have a rather large set of diacritics, and a multiple number of these might be attached to a single letter.

    All that makes free text searching a rather complex endeavor…

    Would you consider all these as “case”?

  2. No, fc does not do full and proper foldcasing, it only does full case-folding.

    perldoc -f fc

    > Casefolding is the process of mapping strings to a form where case differences are erased; comparing two strings in their casefolded form is effectively a way of asking if two strings are equal, regardless of case.

    This describes the concept of the operator foldcase, which needs to add a normalization step to the case-folding step.

    case-folding expands certain characters to longer strings, NFD normalization even more, NFC normalization will contract them at last.

    Without normalization you will not be able to compare multi-byte strings properly.
    perl has this bug since 5.16. every other language has this bug also, but doesn’t brag about proper full foldcase’ing in its docs as perl does.

    https://www.w3.org/International/wiki/Case_folding describes the comparison problem properly.

    For performance cperl and safeclib (a C11 libc) implements fc with NFD. (Same as Apple in its HPFS btw).
    Note that not even libc implements a proper wcsfc() or wcsnorm(). You cannot compare multi-byte strings in POSIX, only with gnulib or ICU, but they are massively over-architectured, unusable for e.g. coreutils. E.g. grep would really like to find strings. Only cperl does so properly.
    safeclib is the first library to implement proper foldcasing in C. FreeBSD will take it from there.

Comments are closed.