Match Unicode property values with a wildcard

Not all characters with a numerical value are “digits”, in the Perl sense. You saw in Match Unicode characters by property value, in which you selected characters by a Unicode property value. That showed up again in Perl v5.18 adds character class set operations. Perl v5.30 adds a slightly easier way to get at multiple numerical values at the same time. Now you can match Unicode property values with wildcards (Unicode TR 18), which are sorta like Perl patterns. Don’t get too excited, though, because these can be expensive.

Here’s what you saw before to match a character that has a particular value for a particular Unicode property:

use v5.12;
use open qw(:std :utf8);

use Unicode::UCD;
say "Unicode ", Unicode::UCD::UnicodeVersion();

foreach ( 1 .. 0x10fffd ) {
	next unless chr =~ m/ \p{Numeric_Value: $ARGV[0]} /x;
	printf "%s (U+%04X)\n", chr, $_;
	}

And then, you saw that you can match one of several numerical values by putting the property values in a character class:

use v5.18;
no warnings qw(experimental::regex_sets);

use open qw(:std :utf8);

use Unicode::UCD;
say "Unicode ", Unicode::UCD::UnicodeVersion();

foreach ( 1 .. 0x10fffd ) {
	next unless chr =~ m/ [\p{nv=1}\p{nv=3}\p{nv=7}] /x;
	printf "%s (U+%04X)\n", chr, $_;
	}

That’s not all that great. You have to repeat the \p{nv=} for each value. Well, you had to, and now you don’t. Perl v5.30 allows the property value to be a wildcard. This looks almost like a Perl pattern:

my $regex = qr| \p{nv=/[137]/} |x;

That’s not really the match operator there and there are many restrictions on what you can do in the pattern. For example, you can’t use a closing brace in your pattern because Perl thinks that closes \p{nv=. Also, you can use an alternate delimiter so you don’t need to change the delimiter for the overall pattern:

my $regex = qr/ \p{nv=:[137]:} /x;

There are a limited subset of Perl patterns, summarized in perlunicode:

  • You can’t have match modifiers outside the pattern. No /a b c/ix, but you can have /(?ix) a b c/.
  • Alternate delimiters must be punctuation, but not {} because the braces interfere with \p{}.
  • No zero-or-more quantifiers.
  • No \p{} inside a \p{}.
  • You automatically get insignificant whitespace around the wildcard: \p{nv= :[0-5]:}

All of that is nice, but realize that you are actually matching that pattern against all the Unicode properties. That can be an expensive operation. The savings in typing might not be worth it.

Leave a comment

0 Comments.

Leave a Reply


[ Ctrl + Enter ]