Treat Unicode strings as grapheme clusters

If you need to work with Unicode strings, you probably don’t want to use Perl’s built-in string manipulation functions. This might seem a strange thing to say about a lnaguage whose main feature is string processing, but it’s a consequence of Perl’s ease in string processing.

Consider what a string is. Think of that for a moment. Write out your definition if you need to. Now, what is a string in Perl? Does it match your definition?

Perl doesn’t have rich types. It’s certainly a typed language, but it’s type is Scalar, and you are used to putting numbers and non-numbers in there. Perl automatically converts from one to the other as it needs to.

Perl provides many string operations, such as substr, index (and rindex), and other functions that use positions within a string. What are these positions, though?

In ASCII-land, those positions are the byte positions. Each character in the string takes up exactly one with a bit leftover. This is the same for many code sets. As such, many people are accustomed to the idea that a byte is a character.

This is an accident of scale. Each position in the ASCII table mapped directly onto storage. That is, if you consider that each ASCII number is a code number, there’s a null mapping onto ASCII storage. If you have the character with ordinal value 0x20, it’s stored as 0x20.

This isn’t true for any Unicode encodings (UTF-32 gets close by using a lot of extra 0s). The Unicode Transformation Format turns code numbers into a series of integers (code units) that it then turns into octets for storage. A particular character, or code point, might not be a stand alone character, such as a or b or 🐱, but something that combines with another character, such as  ̋ or  ̭. Unicode still calls those characters—they have their own code points.

Since you are used to using only the characters given to us by the limited code sets of ASCII, Latin-1, and so on, the idea of a complete glyph, the visual representation of an idea, was wholly contained in the single character. You could still work with single items to get the complete idea.

Unicode, however, lets you create new glyphs by combining characters. For instance, you can write m̋, which should display as an m with a double acute accent. There’s no one-character version for that, but it’s still a single concept. Indeed, this is one of the nicities of Unicode: you can make glyphs that it hasn’t even considered.

This means, however, that you can also create some glyphs in two different ways. The UCS has é as the precomposed Latin-1 carryover, but you can also take a bare e and add the accent yourself. One is a single character and the other has two characters despite being the same glyph. You’re not supposed to care which way you got that é. Indeed, if you cut-and-paste from this blog, you might get something other than what I typed since the intermediate translators might have converted it.

Unicode calls those groups of characters a grapheme cluster, and it’s what you traditionally think is a single character. The grapheme cluster is everything that goes into making the thing you see, and if you deal with grapheme clusters, the difference in the single and multiple character forms don’t matter to you.

Core Perl, doesn’t care about grapheme clusters. It still works on characters. Consider this program that reports the length of two strings that look the same:

use 5.010;
use charnames qw(:full);
binmode STDOUT, ':encoding(UTF-8)';

my $short = "\N{LATIN SMALL LETTER E WITH ACUTE}"; # U+00E9
my $long  = 'e' . "\N{COMBINING ACUTE ACCENT}"; # U+0301 

say "Long  [$long] -> ", length $long;
say "Short [$short] -> ", length $short;

The output shows that one string is longer than the other, even though they look the same:

Long  [é] -> 2
Short [é] -> 1

Instead of treating strings as characters, you can deal with them as grapheme clusters by using Unicode::GCString. This program in mostly the same as the previous one:

use 5.010;
use charnames qw(:full);
use Unicode::GCString;

binmode STDOUT, ':encoding(UTF-8)';

my $short = Unicode::GCString->new( "\N{LATIN SMALL LETTER E WITH ACUTE}" ); # U+00E9
my $long  = Unicode::GCString->new( 'e' . "\N{COMBINING ACUTE ACCENT}" ); # U+0301 

say "Long  [$long] -> ", $long->length;
say "Short [$short] -> ", $short->length;

Now the output is what you probably expected:

Long  [é] -> 1
Short [é] -> 1

This isn’t a terribly interesting example, though. Consider the various ways that you can write the word résumé, then take the third character of each of them:

use 5.010;
use charnames qw(:full);

use Set::CrossProduct;
use Unicode::GCString;

binmode STDOUT, ':encoding(UTF-8)';

my $short = "\N{LATIN SMALL LETTER E WITH ACUTE}"; # U+00E9
my $long  = 'e' . "\N{COMBINING ACUTE ACCENT}"; # U+0301 

my $cross = Set::CrossProduct->new( [ ( [ $short, $long ] ) x 2 ] );
my @strings = map { sprintf 'r%ssum%s', @$_ } $cross->combinations;

foreach my $string ( @strings ) {
	my $third = substr $string, 2, 1;
	say sprintf "[%s][2] -> [%s] [0x%04X]", $string, $third, ord $third;
	}

In some cases, the third character is an s and in other cases it’s the bare combining accent. Look closely at the output; it might look like the accent is missing (depending on how your browser displays these sorts of things), but it might really be the opening square bracket (it’s a combining character after all). This makes it much more difficult to use simple print statements for debugging:

[résumé][2] -> [s] [0x0073]
[résumé][2] -> [s] [0x0073]
[résumé][2] -> [?] [0x0301]
[résumé][2] -> [?] [0x0301]

To get around the oddity of the seemingly disappearing accent, you can look at a hex dump of the output. Remember, the output is encoded as UTF-8, so you won’t see the code numbers. You’ll see the encoded version of them. Knowing that the [ is 0x5B in UTF-8 helps you find where you are. Knowing that COMBINING ACUTE ACCENT is 0xCC81 helps you find the sequences of 0x5BCC81:

$ perl5.14.1 disappearing.pl | hexdump -C
00000000  5b 72 c3 a9 73 75 6d c3  a9 5d 5b 32 5d 20 2d 3e  |[r..sum..][2] ->|
00000010  20 5b 73 5d 20 5b 30 78  30 30 37 33 5d 0a 5b 72  | [s] [0x0073].[r|
00000020  c3 a9 73 75 6d 65 cc 81  5d 5b 32 5d 20 2d 3e 20  |..sume..][2] -> |
00000030  5b 73 5d 20 5b 30 78 30  30 37 33 5d 0a 5b 72 65  |[s] [0x0073].[re|
00000040  cc 81 73 75 6d c3 a9 5d  5b 32 5d 20 2d 3e 20 5b  |..sum..][2] -> [|
00000050  cc 81 5d 20 5b 30 78 30  33 30 31 5d 0a 5b 72 65  |..] [0x0301].[re|
00000060  cc 81 73 75 6d 65 cc 81  5d 5b 32 5d 20 2d 3e 20  |..sume..][2] -> |
00000070  5b cc 81 5d 20 5b 30 78  30 33 30 31 5d 0a        |[..] [0x0301].|
0000007e

As a side note, notice the use of Set::CrossProduct and the list replication operator to easily create the different combinations of composed and decomposed characters for the string.

You might think that you can fix this by normalizing the strings, but you then still have to remember where each new grapheme cluster starts and how many characters you need to grab to get the entire grapheme cluster. Why think that hard?

There’s another issue with characters. If you want to put strings into columns, how would you do that? The number of characters in a Unicode string isn’t the same as the number of spaces it takes up.

use 5.010;
use charnames qw(:full);

use Set::CrossProduct;
use Unicode::GCString;

binmode STDOUT, ':encoding(UTF-8)';

my $short = "\N{LATIN SMALL LETTER E WITH ACUTE}"; # U+00E9
my $long  = 'e' . "\N{COMBINING ACUTE ACCENT}"; # U+0301 

my $cross = Set::CrossProduct->new( [ ( [ $short, $long ] ) x 2 ] );
my @strings = 
	map { Unicode::GCString->new( sprintf 'r%ssum%s', @$_ ) } 
	$cross->combinations;

foreach my $string ( @strings ) {
	say 
		sprintf "[%s] length [%d] columns [%d]", 
		$string, 
		$string->chars,
		$string->columns;
	}

The number of characters is different than the space those characters take up:

[résumé] length [6] columns [6]
[résumé] length [7] columns [6]
[résumé] length [7] columns [6]
[résumé] length [8] columns [6]

Unfortunately, working with grapheme clusters isn’t built into Perl, even if that’s how you probably want to think about strings.

Things to Remember

  • Perl treats strings as Unicode characters
  • A glyph is the visual representation of an idea, and might use several characters
  • The collection of characters to make up a glyph is a grapheme cluster.
  • You can use Unicode::GCString to treat strings as grapheme clusters.