Match Unicode characters by property value

A Unicode character has properties; it knows things about itself. Perl v5.10 introduced a way to match a character that has certain properties that v5.10 supports. In some cases you can match a particular property value. Now v5.12 allows you can match any Unicode property by its value. The newly-supported ones include Numeric_Value and Age, for example:

\p{Numeric_Value: 1}
\p{Nv=7}
\p{Age: 3.0}

Here’s a program to match the numeric value you specify in the command line (interpolation happens first, then regex parsing):

use v5.12;
use open qw(:std :utf8);

use Unicode::UCD;
say "Unicode ", Unicode::UCD::UnicodeVersion();

foreach ( 1 .. 0x10fffd ) {
	next unless chr =~ m/ \p{Numeric_Value: $ARGV[0]} /x;
	printf "%s (U+%04X)\n", chr, $_;
	}

You can find all the characters that have a numeric value of 3. There are 96 matches (in Unicode 5.2.0 at least):

$ perl pnv.pl 3
Unicode 5.2.0
3 (U+0033)
³ (U+00B3)
٣ (U+0663)
۳ (U+06F3)
߃ (U+07C3)
३ (U+0969)
...

You’re not limited to single decimal digits either. Some characters have numeric values greater than 10:

$ perl pnv.pl 11
Unicode 5.2.0
Ⅺ (U+216A)
ⅺ (U+217A)
⑪ (U+246A)
⑾ (U+247E)
⒒ (U+2492)
⓫ (U+24EB)

The highest value I found is 100,000:

$ perl pnv.pl 100000
Unicode 5.2.0
ↈ (U+2188)

If you use a value that isn’t known for that property, you get an error:

$ perl5.12.5 pnv.pl -3
Unicode 5.2.0
Can't find Unicode property definition "Numeric_Value: -3" at ...

You can pre-empt that by constructing the pattern ahead of time and noticing the problem before you go through the code numbers:

use v5.10;

use open qw(:std :utf8);

use Unicode::UCD;
say "Unicode ", Unicode::UCD::UnicodeVersion();

my $pattern = eval { qr| \p{Numeric_Value: $ARGV[0]} |x };
die "Invalid pattern for <$ARGV[0]>!\n" unless $pattern;

foreach ( 1 .. 0x10fffd ) {
	next unless eval chr =~ $pattern;
	printf "%s (U+%04X)\n", chr, $_;
	}

In character classes

Use these property values in a character class to match one of several numeric values:

use v5.12;
use open qw(:std :utf8);

use Unicode::UCD;
say "Unicode ", Unicode::UCD::UnicodeVersion();

foreach ( 1 .. 0x10fffd ) {
	next unless chr =~ m/ [\p{nv=1}\p{nv=3}\p{nv=7}] /x;
	printf "%s (U+%04X)\n", chr, $_;
	}

Now I match the 262 characters with one of those numeric values:

$ perl5.12.5 pnv.pl | more
Unicode 5.2.0
1 (U+0031)
3 (U+0033)
7 (U+0037)
³ (U+00B3)
¹ (U+00B9)
١ (U+0661)
٣ (U+0663)
...

Some of these characters have numeric values but aren’t “numbers” in the Perl sense. Try adding the superscript 3 and the superscript 1 and you don’t get superscript 4 (wouldn’t that be nice?):

$ perl -Mutf8 -le 'print "³" + "¹"'
0
Leave a comment

0 Comments.

Leave a Reply


[ Ctrl + Enter ]