Look up Unicode properties with an inversion map

Perl comes with extracts of the Unicode character data, but it hasn’t been easy to look up all of the information Perl knows about a character. Perl v5.15.7 adds a way to created an inverted map based on the property that you want to access.

The Unicode::UCD module gives you access to some of the information about a character:

use Unicode::UCD 'charinfo';
use charnames qw(:full);
use Data::Dumper;

my $charinfo   = charinfo(
	ord( "\N{SMILING CAT FACE WITH OPEN MOUTH}" ) 
	);
print Dumper( $charinfo ); 

The output has many of the properties, but not all of them:

$VAR1 = {
		  'digit' => '',
		  'bidi' => 'ON',
		  'category' => 'So',
		  'code' => '1F63A',
		  'script' => 'Common',
		  'combining' => 0,
		  'upper' => '',
		  'name' => 'SMILING CAT FACE WITH OPEN MOUTH',
		  'unicode10' => '',
		  'decomposition' => '',
		  'comment' => '',
		  'mirrored' => 'N',
		  'lower' => '',
		  'numeric' => '',
		  'decimal' => '',
		  'title' => '',
		  'block' => 'Emoticons'
		};

This doesn’t include the Age of the character, that is, when the character was added to Unicode. This might seem like a silly thing to know, but it came in handy typesetting Programming Perl. We had problems with some characters but we couldn’t see a pattern until we looked at the age of all the problem characters. Any character added after Unicode 4.0 didn’t typeset correctly. It took some annoying work to get the age by scanning through each age until that property matched:

#!/Users/brian/bin/perls/perl5.15.7

use v5.10;
use utf8;

use List::Util qw(first);

my @chars =  ( 'a', '→', '⣽', "\N{SMILING CAT FACE WITH OPEN MOUTH}" );

my @ages = qw( 1.1 2.1 2.0 3.0 3.1 3.2 4.0 4.1 5.0 5.1 5.2 6.0 );

foreach my $char ( @chars ) {
	my $age = first { $char =~ /\p{Age=$_}/ } @ages;
	say "Age: $age";
	}

It works, but it’s an unsatisifying kludge:

a Age: 1.1
→ Age: 1.1
⣽ Age: 3.0
😺 Age: 6.0

Now, Unicode::UCD has a prop_invmap to create an index based on a property you choose and a _search_invlist to return the offset in the map:

#!/Users/brian/bin/perls/perl5.15.7

use 5.15.7;
use utf8;

use charnames qw(:full);
use List::Util qw(first);
use Unicode::UCD;

my @chars =  ( 'a', '→', '⣽', "\N{SMILING CAT FACE WITH OPEN MOUTH}" );

my @ages = qw( 1.1 2.1 2.0 3.0 3.1 3.2 4.0 4.1 5.0 5.1 5.2 6.0 );

foreach my $char ( @chars ) {
	my $age = age_of_char( $char );
	say "$char Age: $age";
	}

sub age_of_char {
	my( $char ) = @_;
	# create the inverted list, once
	# can only initialize as scalar
	state $inv = _make_age_inverted_list();
	
	my $i = Unicode::UCD::_search_invlist($inv->[0], ord $char);
	return $inv->[1][$i];
	}

# create the inverted list, once
sub _make_age_inverted_list {
	state( $list, $map, $format, $default, $init );
	unless( $init++ ) {
		($list, $map, $format, $default) = Unicode::UCD::prop_invmap("Age");
		$format eq "s" || die "wrong format $format";
		}
	return [ $list, $map ];
	}

That looks like a lot of work, but most of it happens once to setup the inversion map.

Leave a comment

3 Comments.

  1. It might be my USA mindset but it sure seems like to do Unicode anywhere near right it is a full time job. Sort of like needing a DBA for database stuff (if you’re doing serious DB stuff).

    Wow.

    • I think it might be like most topics: it seems really complicated and a lot of effort until you get used to it. Pointers in C, regexes in Perl, and so on seem like second nature once you get used to them. Even though Unicode stuff has been around for two decades, no one has really cared until the last couple of years. Virtually no one has had a chance to get used to it yet, and even Tom Christiansen, who’s done the most to teach me about Unicode, discovers new things everyday.

  2. It’s useful. I just try the Unicode::UCD with Chinese character, and it works very well.

Leave a Reply


[ Ctrl + Enter ]

7ads6x98y