Loose match Unicode character names

The charnames module can now handle loose name matching, as outlined in Unicode Standard Annex #44. This accounts for the various ways people are abusing things.

Consider the character 😻, (U+1F63B SMILING CAT FACE WITH HEART-SHAPED EYES). If you want to interpolate that into a string, you have to use the exact name:

use v5.16;
use open qw(:std :utf8);

say "\N{SMILING CAT FACE WITH HEART-SHAPED EYES}";

Starting with v5.16, the \N{} in a double-quoted string automatically imports :long and :short. There’s another one that you can import yourself, but it’s a bit costly.

Some people don’t like all uppercase strings, so they might want to type it out as title or lowercase:

use v5.16;
use open qw(:std :utf8);

say "\N{Smiling Cat Face With Heart-Shaped Eyes}";

That doesn’t work and you get an error:

Unknown charname 'Smiling Cat Face With Heart-Shaped Eyes'

Import :loose from charnames and it will works:

use v5.16;
use open qw(:std :utf8);
use charnames qw(:loose);

say "\N{Smiling Cat Face With Heart-Shaped Eyes}";

The loose naming rules involve three things, which makes the loose matching slow:

  • Ignore case folding
  • Ignore whitespace
  • Ignore “medial” hyphens (letters on either side)

So all of these work, even the one with consecutive hyphens:

use v5.16;
use open qw(:std :utf8);
use charnames qw(:loose);

say "\N{Smiling Cat Face With Heart Shaped Eyes}";
say "\N{SmilingCatFaceWithHeartShapedEyes}";
say "\N{Smiling-Cat-Face-With-Heart-Shaped-Eyes}";
say "\N{Smiling----Cat-Face-----With-Heart-----Shaped-Eyes}";

Some problematic names

This doesn’t work out well for some names, and Perl developer Karl Williamson made some comments about this to the Unicode Consortium in 2010. There are some names that have hyphens next to whitespace (so, not medial hyphens), but if you ignore whitespace first, then the hyphen isn’t next to whitespace.

Not only that, removing the hyphen can turn it into a character’s name into that for a completely different character:

  • U+0F68 TIBETAN LETTER A
  • U+0F60 TIBETAN LETTER -A
  • U+0FB8 TIBETAN SUBJOINED LETTER A
  • U+0FB0 TIBETAN SUBJOINED LETTER -A
  • U+116C HANGUL JUNGSEONG OE
  • U+1180 HANGUL JUNGSEONG O-E
Leave a comment

0 Comments.

Leave a Reply


[ Ctrl + Enter ]