Match Unicode property values with a wildcard

Not all characters with a numerical value are “digits”, in the Perl sense. You saw in Match Unicode characters by property value, in which you selected characters by a Unicode property value. That showed up again in Perl v5.18 adds character class set operations. Perl v5.30 adds a slightly easier way to get at multiple numerical values at the same time. Now you can match Unicode property values with wildcards (Unicode TR 18), which are sorta like Perl patterns. Don’t get too excited, though, because these can be expensive.

Here’s what you saw before to match a character that has a particular value for a particular Unicode property:

use v5.12;
use open qw(:std :utf8);

use Unicode::UCD;
say "Unicode ", Unicode::UCD::UnicodeVersion();

foreach ( 1 .. 0x10fffd ) {
	next unless chr =~ m/ \p{Numeric_Value: $ARGV[0]} /x;
	printf "%s (U+%04X)\n", chr, $_;
	}

And then, you saw that you can match one of several numerical values by putting the property values in a character class:

use v5.18;
no warnings qw(experimental::regex_sets);

use open qw(:std :utf8);

use Unicode::UCD;
say "Unicode ", Unicode::UCD::UnicodeVersion();

foreach ( 1 .. 0x10fffd ) {
	next unless chr =~ m/ [\p{nv=1}\p{nv=3}\p{nv=7}] /x;
	printf "%s (U+%04X)\n", chr, $_;
	}

That’s not all that great. You have to repeat the \p{nv=} for each value. Well, you had to, and now you don’t. Perl v5.30 allows the property value to be a wildcard. This looks almost like a Perl pattern:

my $regex = qr| \p{nv=/[137]/} |x;

That’s not really the match operator there and there are many restrictions on what you can do in the pattern. For example, you can’t use a closing brace in your pattern because Perl thinks that closes \p{nv=. Also, you can use an alternate delimiter so you don’t need to change the delimiter for the overall pattern:

my $regex = qr/ \p{nv=:[137]:} /x;

There are a limited subset of Perl patterns, summarized in perlunicode:

  • You can’t have match modifiers outside the pattern. No /a b c/ix, but you can have /(?ix) a b c/.
  • Alternate delimiters must be punctuation, but not {} because the braces interfere with \p{}.
  • No zero-or-more quantifiers.
  • No \p{} inside a \p{}.
  • You automatically get insignificant whitespace around the wildcard: \p{nv= :[0-5]:}

All of that is nice, but realize that you are actually matching that pattern against all the Unicode properties. That can be an expensive operation. The savings in typing might not be worth it.

Match Unicode character names with a pattern

Perl has some of the best Unicode support out there, and it keeps getting better. Perl v5.32 supports Unicode 13, and you can now apply patterns to character names. You probably don’t want to do that though.

First, the Unicode Character Database catalogs each character, giving it a code number, a name, and many other properties.

» Read more…

Perl 7 is coming

Perl 7 is coming. In short, it’s v5.32 with different defaults. No new features, no fewer features. Different settings. Sawyer X envisions a release within the next year, maybe sooner.

I’ve made an announcement and written a provisional book on what you need to do. Many of the features and tips I’ve already written about here will help you prepare for Perl 7.

Perl 5.32 was just released, so your first step is to get your code working under that version. After that you should be in pretty good shape.


Turn off indirect object notation

Perl v5.32 adds a way to turn off a Perl feature that you shouldn’t use anyway. You can still use this feature, but now there’s a way to take it away from you. And, with the recent Perl 7 announcement, we see why. Eventually Perl wants to get rid of indirect object notation (and I explain that more in Preparing for Perl 7.

» Read more…

Chain comparisons to avoid excessive typing

Checking that a value is between two others involves two comparisons, and so far in Perl that’s meant that you’ve had to type one of the values more than once. That gets simpler in v5.32 with chained comparisons. This would make Perl one of the few languages that support the feature. So far its implemented in v5.31.10 and until v5.32 is actually released, it isn’t a real feature.

» Read more…

Use a variable-width lookbehind if it won’t match more than 255 characters

In Ignore part of a substitution’s match, I showed you the match resetting \K—it’s basically a variable-width positive lookbehind assertion. It’s a special feature to work around Perl’s lack of variable-width lookbehinds. However, v5.30 adds an experimental feature to allow a limited version of a variable-width lookbehind.

» Read more…

Perl v5.32 New Features

Perl v5.32 is out and it has some interesting new features. The previous major releases focussed more on finally removing deprecations and shoring up odd cases, and you still find a few of those in this release. Full details, as always, are in the perldelta.

Sawyer X just announced Perl 7 as a major version jump that relabels what is now v5.32. If you’re code is ready for v5.32, you should be mostly ready for Perl 7.


Use the infix class instance operator

Perl v5.32 adds Paul Evans’s infix isa operator—the “class instance operator”. One of the delightful things to note about this is addition is that it is one of the features whose development took place almost entirely through a GitHub issue and pull request. GitHub is now the primary repository for the Perl code, and has been since October 2019. This is a feature that I’ll want to use right away in new production code.

» Read more…

Perl 5.30 fixes single quoted qr” with \N{}

The qr// operator allows you to compile a regular expression without applying it to anything. You get the pattern without the match, and you can reuse the pattern as often as you like. Before v5.30, it had an inconsistency with \N{} sequences, but that’s fixed now.

» Read more…

No more false postfix lexical declarations in v5.30

Before Perl v5.10 introduced state variables, people did various things to create persistent lexical variables for a subroutine. With v5.30, one of those constructs is now a fatal error.

Often you want a persistent variable to be scoped and private to a subroutine. But, once you leave that scope, normal lexical variables disappear because their reference count drops to zero. So, no persistence.

» Read more…