Perl v5.22 adds fancy Unicode word boundaries

Perl v5.22’s regexes added four Unicode boundaries to go along with the vanilla “word” boundary, \b, that you’ve been using for years. These new assertions aren’t going to match perfectly with your expectations of human languages (the holy grail of natural language processing), but they do okay-ish. Although these appear in v5.22.0, as a late edition to the language they were partially broken in the initial release. They were fixed for v5.22.1.

Remember that \b in all of its forms is an anchor. It doesn’t match text, but specifies a condition at a position. Since it doesn’t match text, it’s also called a “zero-width assertion”.

Unicode Technical Standard #18 specifies several things a regular expression engine needs to handle for Unicode comformance. These new boundaries add to Perl’s Unicode regex support.

Sentence Boundaries

We don’t have an end-of-sentence character. We use the full stop (.) or other punctuation, but we use those same characters for other things. Unicode Technical Note 29 lays out several heuristic rules for breaking text into sentences; the new \b{sb} codifies those.

This program inserts a bit of text wherever Perl matches a sentence boundary:

use v5.22.1;

$_ = "See Spot. (Spot is a dog.) See Spot run. RunSpot, run!";

s/\b{sb}/#SB#/g;
say $_;

The string #SB# shows up at the guessed sentence boundaries:

#SB#See Spot. #SB#(Spot is a dog.) #SB#See Spot run. #SB#RunSpot, run!#SB#

There are several sentence boundaries:

  • The start of the string
  • The end of the string
  • After closing punctuation and the succeeding whitespace.

The guesses aren’t perfect though. Try it with a different string.

use v5.22.1;

$_ = "Welcome to U.S.A. Mr. Smith. How are you?";

s/\b{sb}/#SB#/g;
say $_;

This sentence has an abbreviation with full stops to separate the words (U.S.A.) and a single word abbreviation (Mr.). The guessing almost gets it right:

#SB#Welcome to U.S.A. #SB#Mr. #SB#Smith. #SB#How are you?#SB#

The heuristic is smart enough not to break inside a group of uppercase letters separated by full stops. It’s not smart enough to not break after Mr. though. That’s not a problem with Perl or a problem with Unicode. It’s a problem with natural langauge representation.

Word Boundaries

Perl’s “word” boundaries, \b, is a bit of a misnomer that we have to explain carefully in Learning Perl classes. Perl v5.22 adds a somewhat better word boundary, \b{wb}, based on the heuristicis in Unicode Technical Note 29.

Perl’s idea of a “word” is something that contains only “word” characters. Those are letters, digits, and the underscore. That’s the character class shortcut \w. It’s also locale dependent. And, it’s different for ASCII and the Universal Character Set (perlrecharclass explains it). It’s not what we think of as natural words. For instance, consider the perlfaq4 answer to How do I capitalize all the words on one line? It’s in the FAQ because it’s a common mistake based on people’s mis-using \w:

use v5.10;
my $string = "fred and barney's lodge";
$string =~ s/(\w+)/\u\L$1/g;
say $string;

This code matches groups of “word” characters and capitalizes them. The s after the apostrophe is also capitalized:

Fred And Barney'S Lodge

Now look for the word boundaries, which are the positions between “word” characters and non-“word” characters:

use v5.10;
my $string = "fred and barney's lodge v2.0";
$string =~ s/\b/#WB#/g;
say $string;

There are plenty of boundaries:

#WB#fred#WB# #WB#and#WB# #WB#barney#WB#'#WB#s#WB# #WB#lodge#WB# #WB#v2#WB#.#WB#0#WB#

You see boundaries at:

  • Start of the string (a virtual non-“word” character)
  • End of the string (a virtual non-“word” character)
  • Between numbers and non-{letters|digits|underscore}
  • Between whitespace and letters

Unicode Technical Note 29 defines some rules to avoid problems such as the apostrophe and decimal points. Try the same thing with the Unicode rules for word boundaries by using \b{wb}:

use v5.22.1;
my $string = "fred and barney's lodge v2.0";
$string =~ s/\b{wb}/#WB#/g;
say $string;

Now barney's and v2.0 are considered complete words:

#WB#fred#WB# #WB#and#WB# #WB#barney's#WB# #WB#lodge#WB# #WB#v2.0#WB#

Try it in a different way. Here’s a program to pull out each word. It uses the match operator in scalar context to find text between word boundaries:

my $string = "fred and barney's lodge v2.0";

while( $string =~ m/\b(\w.*?)\b/g ) {
	say $1;
	}

You get some words that we don’t think of as words:

fred
and
barney
s
lodge
v2
0

Change those \b to \b{wb}:

use v5.22.1;
my $string = "fred and barney's lodge v2.0";

while( $string =~ m/\b{wb}(\w.*?)\b{wb}/g ) {
	say $1;
	}

Now barney's and v2.0 stick together:

fred
and
barney's
lodge
v2.0

Grapheme Boundaries

The regex dot matches any character except the newline. However, in the Unicode world, a “character” isn’t the same thing we think of as a character.

We consider the whole concept and representation of an idea of the “character”. The é is a “character” in our human minds, but that’s not how we use the word “character” in the jargon. Speaking casually about these topics often leads to pain and frustration.

In the bowels of computers, that single idea é can be constructed of two parts: the base e and the accent ´. Those are U+0065 (LATIN SMALL LETTER E) and U+0301 (COMBINING ACUTE ACCENT). Or, it might be a single part: U+00E9 (LATIN SMALL LETTER E WITH ACUTE). We call those forms the “decomposed” and “composed” version even though they represent the same idea. Some of that is a quirk of history. I write more about that in one of the appendices for Learning Perl.

Everything that you need to build the complete idea is the “grapheme cluster” (See Treat Unicode strings as grapheme clusters). That’s more akin to what we think of as a written “character”. We probably want to match grapheme clusters and don’t care how they are represented or constructed.

If you are trying to break up text, you don’t want to break apart the characters of a grapheme. Instead of matching the dot, you can use the \X introduced in Perl v5.6. That matches an “extended grapheme cluster”, or, in the language of regexes, (?>\PM\pM*). That’s the sub-pattern that matches a non-mark character followed by zero or more mark characters.

Perl v5.22.1 goes even further by adding a grapheme cluster boundary, \b{gcb}, that recognizes the position bewteen two grapheme clusters. This is also available as \b{g}. Again, this is probably what we think of as “characters” in casual conversation.

This demonstration takes a bit of setup. You have a literal string that has é and ös. These can be composed or decomposed, and to guard against what editors and other translators might do, you use the NFD from Unicode::Normalize to ensure you get the decomposed form. With the composed form this won’t work. Then, you break that string apart in a variety of ways:

#!/Users/brian/bin/perls/perl5.22.1
use v5.22.1;
use utf8;
use open qw(:std :utf8);

use Unicode::Normalize qw(NFD);

my $string = NFD( "The résumé of Ms. Sörensen." );

my @dot_chars = $string =~ m/(.)/g;
say "    .: ", join ' ', @dot_chars;

my @gcb_chars = $string =~ m/(\X)/g;
say "   \\X: ", join ' ', @gcb_chars;

my @split_chars = split /\b{gcb}/, $string;
say "split: ", join ' ', @split_chars;

Matching with the dot distinguishes between the e and the ´ as well as the o and the ¨. It see those as different characters.

The \X sees the combination of e and ´ as a single grapheme cluster and keeps them together. Splitting on the \b{gcb} also keeps the characters in their proper clusters:

    .: T h e   r e ́ s u m e ́   o f   M s .   S o ̈ r e n s e n .
   \X: T h e   r é s u m é   o f   M s .   S ö r e n s e n .
split: T h e   r é s u m é   o f   M s .   S ö r e n s e n .

Now, consider this example that reverses the order of the matched elements before it prints them:

use v5.22.1;
use utf8;
use open qw(:std :utf8);

use Unicode::Normalize qw(NFD);

my $string = NFD( "The résumé of Ms. Sörensen." );

my @dot_chars = $string =~ m/(.)/g;
say "    .: ", join '', reverse @dot_chars;

my @split_chars = split /\b{gcb}/, $string;
say "split: ", join '', reverse @split_chars;

You can easily see the problem of pulling apart grapheme clusters as the accent characters now attach to the wrong characters. Splitting on the grapheme cluster boundary gets it right:

    .: .nesner̈oS .sM fo ́emuśer ehT
split: .nesneröS .sM fo émusér ehT

Things to Remember

  • Perl v5.22 expands the word boundary for particular types with the \b{}
  • The \b{sb} matches a sentence boundary.
  • The \b{wb} matches an improved word boundary.
  • The \b{gcb} matches at a grapheme cluster boundary.
Leave a comment

0 Comments.

Leave a Reply


[ Ctrl + Enter ]