Category Archives: Unicode

Match Unicode property values with a wildcard

Not all characters with a numerical value are “digits”, in the Perl sense. You saw in Match Unicode characters by property value, in which you selected characters by a Unicode property value. That showed up again in Perl v5.18 adds character class set operations. Perl v5.30 adds a slightly easier way to get at multiple […]

Match Unicode character names with a pattern

Perl has some of the best Unicode support out there, and it keeps getting better. Perl v5.32 supports Unicode 13, and you can now apply patterns to character names. You probably don’t want to do that though. First, the Unicode Character Database catalogs each character, giving it a code number, a name, and many other […]

Match only the same Unicode script

Earlier this year, this website was the target of some sort of attack in which a bot sent seemingly random data in its requests. The attack wasn’t that big of a deal since I easily blocked it with Cloudflare, but it was interesting. The apparently random data was actually a mix of Latin, Hangul, and […]

Use Unicode 10 in Perl v5.28

Perl v5.28 updates to Unicode 10. There are 8,518 new characters, 7,473 which are in the CJK extension. There are 56 new emojis. And, the Bitcoin symbol, ₿. It adds a T. rex, 🦖, but we’re still waiting for a raptor. To Perl they are just characters like any other so you don’t need anything […]

Find the new emojis in Perl’s Unicode support

Perl v5.26 updates itself to Unicode 9. That’s not normally exciting news but people have been pretty enthusiastic about the 72 new emojis that come. As far as Perl cares, they are just valid code points like all of the other ones.

Look up Unicode properties with an inversion map

Perl comes with extracts of the Unicode character data, but it hasn’t been easy to look up all of the information Perl knows about a character. Perl v5.15.7 adds a way to created an inverted map based on the property that you want to access.

Fold cases properly

You might think that you know how to compare strings regardless of case, and you’re probably wrong. After you read this Item, you’ll be able to do it correctly and without doing any more work than you were doing before. Perl handles all the details for you.

Loose match Unicode character names

The charnames module can now handle loose name matching, as outlined in Unicode Standard Annex #44. This accounts for the various ways people are abusing things. Consider the character 😻, (U+1F63B SMILING CAT FACE WITH HEART-SHAPED EYES). If you want to interpolate that into a string, you have to use the exact name: use v5.16; […]

Normalize your Perl source

Perl has had Unicode support since Perl 5.6, which means that most Perl tutorials have been bending the truth a bit when they tell you that a Perl identifier, the name that you give to variables, starts with [A-Za-z_] and continues with [0-9A-Za-z_]. With Unicode support, you have many more characters available to you, but […]

Know the difference between utf8 and UTF-8

Perl actually has two encodings that get the letters u, t, f, and 8. One will happily let you do bad things, and the other will let you do bad things but with a warning that you can make fatal.