You might think that you know how to compare strings regardless of case, and you’re probably wrong. After you read this Item, you’ll be able to do it correctly and without doing any more work than you were doing before. Perl handles all the details for you.
If you grew up in the ASCII world, case insensitivity is a difference of literally one bit, so changing case is setting or unsetting a bit in the octet that represents that character.
If you’ve read the Perl FAQ, you may have seen this quip:
“Perl” is the name of the language. Only the “P” is capitalized. The name of the interpreter (the program which runs the Perl script) is “perl” with a lowercase “p”.
When Larry Wall was asked what the difference between “Perl” and “perl”, he said “One bit”. It’s literally a difference of flipping one bit in the ASCII representation. That’s as complicated as ASCII case folding gets.
The capital letter P has the ordinal value 0b1010000. The small letter p, which shows up later in the ASCII sequence, has the ordinal value 0b1110000. This makes it extremely easy to write routines to change between upper and lower cases:
use v5.10;
say " U L";
say "-----";
foreach my $char ( qw(p P a b c A B C) ) {
my $lower = chr( ord($char) | 0b0100000 );
my $upper = chr( ord($char) & 0b1011111 );
say "$char $upper $lower";
}
The output shows what you’d expect for the upper and lower cases:
U L
-----
p P p
P P p
a A a
b B b
c C c
A A a
B B b
C C c
Since bit flipping is easy to do, it’s very easy for even primitive computers to quickly change case (assuming that you’re not so primitive as to not have two cases). But, this only works if you restrict the output to the ASCII letters. If you want to handle non-letters, you have to do a bit more work to ensure that you don’t shift them into other characters:
use v5.10;
say " U L";
say "-----";
foreach my $char ( qw(p P a b c A B C # !) ) {
my $upper = uppercase( $char );
my $lower = lowercase( $char );
say "$char $upper $lower";
}
sub lowercase {
my $_ = shift;
my $ord = ord();
return $_ unless $ord >= 0x41 and $ord <= 0x5A;
return chr( $ord ^ 0b100000 );
}
sub uppercase {
my $_ = shift;
my $ord = ord();
return $_ unless $ord >= 0x61 and $ord <= 0x7A;
return chr( $ord ^ 0b100000 );
}
Now the non-letters stay the same character:
U L
-----
p P p
P P p
a A a
b B b
c C c
A A a
B B b
C C c
# # #
! ! !
This almost works for Latin-* encodings too. When you move out of the ASCII sequence into Unicode, you don't have this luxury, and it's not merely a representational issue.
If you were infected with ASCII early, you've grown up thinking that you can go back and forth between upper and lower cases and always get the same result. Outside of ASCII, that's not necessarily true. Consider the word "Reichwaldstraße", a common street name in Germany. The "straße" has the special character ß (U+00DF ʟᴀᴛɪɴ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ꜱʜᴀʀᴘ ꜱ). which is a ligature of a long s, the fancy ſ (U+017F ʟᴀᴛɪɴ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ʟᴏɴɢ ꜱ) that you may have seen in historical documents, and the familiar short s. Put them together, ſs, and move them close enough and you can see how you would end up with ß once you connect the hanging portion of the long s with the top of the short s. The UCS has an uppercase version (U+1E9E ʟᴀᴛɪɴ ᴄᴀᴘᴛɪᴀʟ ʟᴇᴛᴛᴇʀ ꜱʜᴀʀᴘ ꜱ), although no one uses it aside from saying that no one uses it. U+1E9E lowercases to U+00DF, but U+00DF has no single character uppercase version; it's the two characters SS. The lowercase of SS, however, is ss:
use utf8;
my $string = "Reichwaldstraße";
my $upper = uc( $string );
my $lower = lc( $upper );
print <<"HERE";
Started with: $string
Upper: $upper
Lower: $lower
HERE
The output shows that you don't get back to the original:
Started with: Reichwaldstraße
Upper: REICHWALDSTRASSE
Lower: reichwaldstrasse
There's another s that causes problems: the Greek sigma, which comes in two lowercase forms. One appears in the middle of words and the other appears at the end, as in όσος, where σ and ς represent the same thing, just in different forms mandated by their position:
use utf8;
my $char = "όσος";
my $upper = uc( $char );
my $lower = lc( $upper );
print <<"HERE";
Started with: $char
Upper: $upper
Lower: $lower
HERE
Again, the lowercase version at the end is different than what you started with:
Started with: όσος
Upper: ΌΣΟΣ
Lower: όσοσ
This means that you can't merely use lc to normalize text for case insensitive comparison. These won't compare correctly:
lc( "Reichwaldstraße" ) eq lc( "REICHWALDSTRASSE" ); # Nope!
lc( 'όσος' ) eq lc( 'ΌΣΟΣ' ); # Nope!
You might object that these are different strings and that they shouldn't be the same, but where did these strings start? Perhaps that REICHWALDSTRASSE was not originally all uppercase, but changed by some stupid filters between you and the original information (and with a name like mine, I know about stupid casing filters). That's part of the ASCII infection.
So, lc is the wrong way. Sadly, we do this incorrectly in Learning Perl, when we show this subroutine we want to sort:
sub case_insensitive { "\L$a" cmp "\L$b" }
The Unicode specification solves this with its case folding rules. In short, it folds characters with different case forms into a common form. There's not a rule for this; they do it by exhaustion, specifying the common form for each fold. The common form is defined in the Unicode Character Database, which the Perl developers have digested into the files you find in the unicore/ directory in your Perl library. Here's a few lines from unicore/CaseFolding.txt:
0050; C; 0070; # LATIN CAPITAL LETTER P
0051; C; 0071; # LATIN CAPITAL LETTER Q
0052; C; 0072; # LATIN CAPITAL LETTER R
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
03A3; C; 03C3; # GREEK CAPITAL LETTER SIGMA
03C2; C; 03C3; # GREEK SMALL LETTER FINAL SIGMA
FB00; F; 0066 0066; # LATIN SMALL LIGATURE FF
FB01; F; 0066 0069; # LATIN SMALL LIGATURE FI
FB02; F; 0066 006C; # LATIN SMALL LIGATURE FL
FB03; F; 0066 0066 0069; # LATIN SMALL LIGATURE FFI
FB04; F; 0066 0066 006C; # LATIN SMALL LIGATURE FFL
The first column is the code number of the original character, the second is the type of folding (explained in the data file and coming up later), and the third column are the code numbers that form the common, folded ("equivalent") version. Essentially, it's a big hash. Notice that some of the folded versions are multiple characters. You're not going to get that with bit fiddling.
Case folding takes the character in the first column and turns them into the characters in the third column, then takes the result and does it again until there are no more folds possible. It keeps doing that until there is nothing to replace. Characters that don't have an entry in this file fold into themselves. You case fold to compare strings, not to normalize strings for storage or other uses. Case folding makes case insensitive comparisons very fast, but it also loses information that you can't recover. You can read the exact rules in Section 5.18, "Case mapping", of the Unicode Standard.
To see how that works, try that with Reichwaldstraße and όσος. All characters except two stay the same, and two use the mapping from unicore/CaseFolding.txt:
- Reichwaldstraße → reichwaldstrasse
- REICHWALDSTRASSE → reichwaldstrasse
- όσος → ΌΣΟΣ
- ΌΣΟΣ → όσοσ
To implement these operations, Perl v5.16 adds the fc built-in function. Instead of lc, use that:
use v5.15.8; # until we get v5.16 XXX feature
fc( "Reichwaldstraße" ) eq fc( "REICHWALDSTRASSE" ); # Yep!
fc( 'όσος' ) eq fc( 'ΌΣΟΣ' ); # Yep!
If you don't have v5.16, you can use the fc front the Unicode::CaseFold module on CPAN.
If you wanted to do this inside a double-quoted string, you can use the \F case shift operator (but be aware of the things we noted in Understand the order of operations in double quoted contexts). Our Learning Perl example could change to:
sub case_insensitive { "\F$a" cmp "\F$b" }
More complicated folds
Looking back at the extract of unicore/CaseFolding.txt, you might remember that I skipped over the second column, the mapping status. Those letters stand for different folding rules:
- C: common case folding
- F: full case folding (strings may grow in length)
- S: simple case folding (map to single characters)
- T: special case for uppercase I and dotted uppercase I
The "T" status stands in for folds that the general rules can't handle, mostly some characters from Turkish and similar languages.
So far, Perl's fc only handles the "F" status for full case folding. It doesn't handle the special folding you'll find in unicore/SpecialCasing.txt that has the oddball situations, such as multiple source characters folding onto other multiple characters. If you want to handle those, you're on your own, although the Unicode::Casing module on CPAN might help.
Many of the folding rules depend on the source language, so you'll probably want to pay special attention if you are using that language or completely ignore them if you are not.
Besides that, the Universal Character Set gives people much more of a chance to mess up. Suppose that you want to write "β-carotene", that thing you get from carrots. That first character is β (U+03B2 ɢʀᴇᴇᴋ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ʙᴇᴛᴀ). Some people might think it looks like ß (U+00DF ʟᴀᴛɪɴ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ꜱʜᴀʀᴘ ꜱ), and that's good enough for them. No amount of case folding is going to let you know that someone used an incorrect character. But, this is also one of the benefits of Unicode: characters know what they are.
Another correct way
There's another correct way to check strings regardless of case. You can use the /i flag on the match operator. The Unicode-aware Perl regex engine handles the rest:
use utf8;
use v5.15.7;
use Set::CrossProduct;
my $string = "Reichwaldstraße";
my $upper = uc( $string );
my $lower = lc( $upper );
my $sets = Set::CrossProduct->new(
[
[ $string, $upper, $lower ],
[ $string, $upper, $lower ],
]
);
foreach my $tuple ( $sets->combinations ) {
my( $l, $r ) = @$tuple;
next if $l eq $r;
say "lc($r) eq lc($l) ? ", lc($r) eq lc($l) ? "matched" : "failed";
say "fc($r) eq fc($l) ? ", fc($r) eq fc($l) ? "matched" : "failed";
say "$r =~ m/$l/i ? ", $l =~ m/$r/i ? "matched" : "failed";
say;
}
In the output, you can see that lc sometimes fails, but that the fc and m//i always works:
lc(REICHWALDSTRASSE) eq lc(Reichwaldstraße) → failed
fc(REICHWALDSTRASSE) eq fc(Reichwaldstraße) → matched
REICHWALDSTRASSE =~ m/Reichwaldstraße/i → matched
lc(reichwaldstrasse) eq lc(Reichwaldstraße) → failed
fc(reichwaldstrasse) eq fc(Reichwaldstraße) → matched
reichwaldstrasse =~ m/Reichwaldstraße/i → matched
lc(Reichwaldstraße) eq lc(REICHWALDSTRASSE) → failed
fc(Reichwaldstraße) eq fc(REICHWALDSTRASSE) → matched
Reichwaldstraße =~ m/REICHWALDSTRASSE/i → matched
lc(reichwaldstrasse) eq lc(REICHWALDSTRASSE) → matched
fc(reichwaldstrasse) eq fc(REICHWALDSTRASSE) → matched
reichwaldstrasse =~ m/REICHWALDSTRASSE/i → matched
lc(Reichwaldstraße) eq lc(reichwaldstrasse) → failed
fc(Reichwaldstraße) eq fc(reichwaldstrasse) → matched
Reichwaldstraße =~ m/reichwaldstrasse/i → matched
lc(REICHWALDSTRASSE) eq lc(reichwaldstrasse) → matched
fc(REICHWALDSTRASSE) eq fc(reichwaldstrasse) → matched
REICHWALDSTRASSE =~ m/reichwaldstrasse/i → matched
The match operator isn't useful for sort though, since you can only tell if the strings are the same.
Things to remember
- Case-folding is more complicated than merely lowercasing.
- The
fc does proper case folding according to the Unicode standard.
- The
\F case fold operator does full case folding in double-quoted contexts.