Match Unicode character names with a pattern

Perl has some of the best Unicode support out there, and it keeps getting better. Perl v5.32 supports Unicode 13, and you can now apply patterns to character names. You probably don’t want to do that though.

First, the Unicode Character Database catalogs each character, giving it a code number, a name, and many other properties.

Char Code number Name
à U+00E0 ʟᴀᴛɪɴ sᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴀ ᴡɪᴛʜ ɢʀᴀᴠᴇ
á U+00E1 ʟᴀᴛɪɴ sᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴀ ᴡɪᴛʜ ᴀᴄᴜᴛᴇ
🐪 U+1F42A ᴅʀᴏᴍᴇᴅᴀʀʏ ᴄᴀᴍᴇʟ
🐫 U+1F42B ʙᴀᴄᴛʀɪᴀɴ ᴄᴀᴍᴇʟ

Start with what Perl can already do. Match characters by their complete name in uppercase (strict Unicode matching). Put the character name in \N{}:

use utf8;
use v5.10;

$_ = "\x{1F42A}"; # Dromedary
say "1: ", /\N{DROMEDARY CAMEL}/ ? 'Matched' : 'Missed';

$_ = "\x{1F42B}"; # Bactrian
say "2: ", /\N{BACTRIAN CAMEL}/ ? 'Matched' : 'Missed';

Both of these match because they use the exact names:

1: matched
2: matched

The charnames module, which comes with Perl, allows “loose” matching. If you’re willing to give up to a 3x performance hit for Perl to match against all the names to determine a match, you can use mixed case and insignificant underscores and hyphens:

use utf8;
use v5.10;

use charnames qw(:loose);

$_ = "\x{1F42A}"; # Dromedary
say "1: ", /\N{DROMEDARY CAMEL}/ ? 'Matched' : 'Missed';
say "2: ", /\N{DROMEDARY_CAMEL}/ ? 'Matched' : 'Missed';
say "3: ", /\N{dromedary camel}/ ? 'Matched' : 'Missed';

$_ = "\x{1F42B}"; # Bactrian
say "4: ", /\N{BACTRIAN CAMEL}/ ? 'Matched' : 'Missed';
say "5: ", /\N{BACTRIAN-CAMEL}/ ? 'Matched' : 'Missed';
say "6: ", /\N{bactrian camel}/ ? 'Matched' : 'Missed';

All of these match even though they aren’t the exact names:

1: Matched
2: Matched
3: Matched
4: Matched
5: Matched
6: Matched

You can’t interpolate inside \N{}. This is the literal name $type CAMEL, and there’s no such character even under loose matching:

use utf8;
use v5.10;

$_ = "\x{1F42A}"; # Dromedary
my $type = 'DROMEDARY';
say "1: ", /\N{$type CAMEL}/ ? 'Matched' : 'Missed';

It fails at compile time:

Unknown charname '$type CAMEL' at ...
Execution of ... aborted due to compilation errors.

Instead, you can make your own aliases to names and avoid the loose matching:

use utf8;
use v5.10;

use charnames ':alias' => {
	ONE_HUMP_CAMEL => 'DROMEDARY CAMEL',
	TWO_HUMP_CAMEL => 'BACTRIAN CAMEL',
	};

$_ = "\x{1F42A}"; # Dromedary
my $type = 'DROMEDARY';
say "1: ", /\N{ONE_HUMP_CAMEL}/ ? 'Matched' : 'Missed';

What’s new in v5.32

Perl v5.32 allows you to specify a name with the new \p{Name=} syntax. It uses loose Unicode matching and allows interpolation:

use utf8;
use v5.32;

$_ = "\x{1F42A}"; # Dromedary
my $type = 'DROMEDARY';
say "1: ", /\p{Name=DROMEDARY CAMEL}/ ? 'Matched' : 'Missed';
say "2: ", /\p{Name=dromedary camel}/ ? 'Matched' : 'Missed';
say "3: ", /\p{Name=$type CAMEL}/     ? 'Matched' : 'Missed';
say "4: ", /\p{Name=$type camel}/     ? 'Matched' : 'Missed';
say "5: ", /\p{Name=$type c_a_M-e-L}/ ? 'Matched' : 'Missed';

# Vary "Name" too
say "6: ", /\p{name=DROMEDARY CAMEL}/ ? 'Matched' : 'Missed';
say "7: ", /\p{Na=DROMEDARY CAMEL}/   ? 'Matched' : 'Missed';
say "8: ", /\p{na=DROMEDARY CAMEL}/   ? 'Matched' : 'Missed';

It gets better though. You can have subpatterns for the name. It’s an already-enabled experimental feature but you can disable its warning:

use utf8;
use v5.32;
no warnings qw(experimental::uniprop_wildcards);

$_ = "\x{1F42A}"; # Dromedary

say "1: ", m<\p{Name=/(DROMEDARY|BACTRIAN) CAMEL/}> ? 'Matched' : 'Missed';

Here’s a pattern that finds the Latin Small Letters:

use utf8;
use v5.32;
no warnings qw(experimental::uniprop_wildcards);

my @letters = qw(A b C d E);

foreach my $letter ( @letters ) {
	say "$letter: ",
		$letter =~ m<\p{Name=/LATIN SMALL LETTER [A-Z]/}>
		?
		'Matched' : 'Missed';
	}

And here’s a program that tests if a character’s name includes WITH, indicating there’s something else with the base character:

use utf8;
use v5.32;
use open qw(:std :utf8);

no warnings qw(experimental::uniprop_wildcards);

use charnames qw();

my @letters = qw(a à á â ã ä å);

foreach my $letter ( @letters ) {
	my $name = charnames::viacode( ord $letter );
	say "$letter ($name):",
		$letter =~ m<\p{Name=/\bWITH\b/}>
		?
		'Matched' : 'Missed';
	}

This matches the accented characters from that list:

a (LATIN SMALL LETTER A):Missed
à (LATIN SMALL LETTER A WITH GRAVE):Matched
á (LATIN SMALL LETTER A WITH ACUTE):Matched
â (LATIN SMALL LETTER A WITH CIRCUMFLEX):Matched
ã (LATIN SMALL LETTER A WITH TILDE):Matched
ä (LATIN SMALL LETTER A WITH DIAERESIS):Matched
å (LATIN SMALL LETTER A WITH RING ABOVE):Matched

Not quite the match operator

That syntax in the \N{} looks like the match operator:

\p{Name=/\bWITH\b/i}

It’s not. You can’t use the leading m or set flags. These are compilation errors:

\p{Name=m/\bWITH\b/}
\p{Name=/\bWITH\b/i}

However, you can set flags inside the pattern:

\p{Name=/(?ix)\b w i T h \b /}

The quantifiers are a bit weird. You can’t use the zero-or-more star, and the braces for the generalized quantifier interfere with the \N{} syntax. Non-backtracking modifiers aren’t allowed, but the non-greedy ? is:

\p{Name=/\S+ WITH\b/}      # Yes
\p{Name=/\S* WITH\b/}      # Error
\p{Name=/\S{1,} WITH\b/}   # Error
\p{Name=/\b\S++ITH\b/}     # Error
\p{Name=/\b\S+?ITH\b/}     # Yes

Things to remember

  • You can loosely match Unicode character names, but an alias might be better.
  • You can match the Name property of a character with a limited set of pattern features.
  • Applying patterns to every Unicode character name can be very slow.
Leave a comment

0 Comments.

Leave a Reply


[ Ctrl + Enter ]