Perl v5.18 adds character class set operations

Perl v5.18 added experimental character code set operations, a requirement for full Unicode support according to Unicode Technical Standard #18, which specifies what a compliant language must support and divides those into three levels.

The perlunicode documentation lists each requirement and its status in Perl. Besides some regular expression anchors handling all forms of line boundaries (which might break older programs), set subtraction and intersection in character classes was the last feature Perl needed to be Level 1 compliant.

Perl calls this experimental feature “Extended Bracketed Character Classes” in perlrecharclass. Inside the (?[ ]), a regular expression does character class set operations. Inside the brackets, whitespace is insignificant (as if /x is on). Here’s a simple example to find the character z:

use v5.18;
no warnings qw(experimental::regex_sets);

my $regex = qr/(?[ [z] ])/;

while(  ) {
	chomp;
	say "[$_] ", /$regex/ ? 'Matched' : 'Missed';
	}

__DATA__
This is a line
This is the next line
And here's another line

None of the input lines have a letter z, so nothing matches:

[This is a line] Missed
[This is the next line] Missed
[And here's another line] Missed

To add more characters to the set, in old Perl (and still, even), you would add that character in the same set of brackets. If you want to find an x, you add that next to the z:

use v5.18;
no warnings qw(experimental::regex_sets);

my $regex = qr/(?[ [xz] ])/;

while(  ) {
	chomp;
	say "[$_] ", /$regex/ ? 'Matched' : 'Missed';
	}

__DATA__
This is a line
This is the next line
And here's another line

And now the middle input line matches:

[This is a line] Missed
[This is the next line] Matched
[And here's another line] Missed

But, you can do this with set math. Since you want either of those to match, you would take a union. Inside the (?[ ]), a + is the union operator (the | is also the union operator). Almost everything inside (?[ ]) is a metacharater, which is why you had to have another set of brackets around the literal characters in the previous example:

use v5.18;
no warnings qw(experimental::regex_sets);

my $regex = qr/(?[ [x] + [z] ])/;

while(  ) {
	chomp;
	say "[$_] ", /$regex/ ? 'Matched' : 'Missed';
	}

__DATA__
This is a line
This is the next line
And here's another line

The output is the same as before because it’s the same character class:

[This is a line] Missed
[This is the next line] Matched
[And here's another line] Missed

You can also do intersections with the &. In this example, you have two separate character classes that each have one character that matches each input line and they only have one character in common:

use v5.18;
no warnings qw(experimental::regex_sets);

my $regex = qr/(?[ [sxy] & [exw] ])/;


while(  ) {
	chomp;
	say "[$_] ", /$regex/ ? 'Matched' : 'Missed';
	}

__DATA__
This is a line
This is the next line
And here's another line

Their union is only x, so only that character matches and you get the same input, again:

[This is a line] Missed
[This is the next line] Matched
[And here's another line] Missed

The - is the set subtraction operator. In this example, the first character class are Perl word characters. You subtract from that the ASCII alphabetical characters, leaving only the digits and underscore:

use v5.18;
no warnings qw(experimental::regex_sets);

my $regex = qr/(?[ [\w] - [a-zA-Z] ])/;


while(  ) {
	chomp;
	say "[$_] ", /$regex/ ? 'Matched' : 'Missed';
	}

__DATA__
This is 1 line
This is the next line
And here's another line

Only the first line has a digit, so only it matches:

[This is 1 line] Matched
[This is the next line] Missed
[And here's another line] Missed

This gets more interesting with named properties, the only Level 2 feature Perl supports so far (see perluniprops). Some character classes may be easier to construct, read, and maintain without losing their literal characters. Suppose you want to get just the Eastern Arabic digits, perhaps because you’re in a country that uses Arabic as I am as I write this. You can take the intersection of the Arabic property and the Digit property. The Universal Character Set has this wonderful feature to assign many labels to its characters so we can identify subsets of a particular script:

use v5.18;
use utf8;
use open qw(:std :utf8);

no warnings qw(experimental::regex_sets);

my $regex = qr/(?[ \p{Arabic} & \p{Digit} ])/;

foreach my $ord ( 0 .. 0x10fffd ) {
	my $char = chr( $ord );
	say $char if $char =~ m/$regex/;
	}

Now you see just the digits from that script:

۰
۱
۲
۳
۴
۵
۶
۷
۸
۹

You can get more complicated. If you wanted the Western Arabic Digits too (what we normally call just “arabic numerals”). Although some of this problem is easy, that doesn’t show off the operations. In this example, you have two separate intersections that are joined in a union:

use v5.18;
use utf8;
use open qw(:std :utf8);

no warnings qw(experimental::regex_sets);

my $regex = qr/(?[ 
	( \p{Arabic} & \p{Digit} ) 
		+ 
	( \p{ASCII}  & \p{Digit} ) 
	])/;

foreach my $ord ( 0 .. 0x10fffd ) {
	my $char = chr( $ord );
	say $char if $char =~ m/$regex/;
	}

Now you see two sets of numerals:

0
1
2
3
4
5
6
7
8
9
۰
۱
۲
۳
۴
۵
۶
۷
۸
۹

There is one more character class set operator, the ^, which acts like an exclusive-or (the xor bit operator uses the same character. This operator takes the union of the two character classes then subtracts their intersection. That is, the resulting set has all the characters in both classes except for the ones they both have.

In this example, you have two intersections to extract the hex digits and digits from ASCII. That’s important since other scripts in the UCS have characters with these properties. From those intersections, you use the ^ to get the set that only contains the characters that show up in exactly one set.

use v5.18;
use utf8;
use open qw(:std :utf8);

no warnings qw(experimental::regex_sets);

my $regex = qr/(?[ 
	( \p{ASCII} & \p{HexDigit} )
		^ 
	( \p{ASCII} & \p{Digit} )
	])/;

foreach my $ord ( 0 .. 0x10fffd ) {
	my $char = chr( $ord );
	say $char if $char =~ m/$regex/;
	}

In this case, it’s the uppercase and lowercase letters:

A
B
C
D
E
F
a
b
c
d
e
f

Things to remember

  • Regular expression character class set operations satisfy UTS #18 Level 1 requirements.
  • You can compose character classes from other classes with unions, intersections, and subtractions.
  • Inside the (?[ ]), whitespace is insignificant.
  • regex_sets is an experimental feature.

Don’t use named lexical subroutines

Perl v5.18 allows you to define named subroutines that exist only in the current lexical scope. These act (almost) just like the regular named subroutines that you already know about from Learning Perl, but also like the lexical variables that have limited effect. The problem is that the feature is almost irredeemably broken, which you’ll see at the end of this Item. » Read more…

Enforce ASCII semantics when you only want ASCII

When Perl made regexes more Unicode aware, starting in v5.6, some of the character class definitions and match modifiers changed. What you expected to match \d, \s, or \w are more expanvise now (Know your character classes under different semantics). Most of us probably didn’t notice because the range of our inputs is limited. » Read more…

Perl v5.16 now sets proper magic on lexical $_

Perl v5.10 introduced given and the lexical $_. That use of $_, which everyone has assumed is a global variable, turned out to be a huge mistake. The various bookkeeping on the global version didn’t happen with the lexical version, so strange things happened. » Read more…

Use a computed label with loop controllers

Not sure which loop you want to break out of? Perl v5.18 makes that easy with computed labels. The value you give next, last, and redo no longer has to be a literal. You could already do this with goto, but now you can give the loop controllers an expression. » Read more…

Perl v5.20 fixes taint problems with locale

Perl v5.20 fixes taint checking in regular expressions that might use the locale in its pattern, even if that part of the pattern isn’t a successful part of the match. The perlsec documentation has noted that taint-checking did that, but until v5.20, it didn’t.

The only approved way to untaint a variable is through a successful pattern match with captures: » Read more…

Use postfix dereferencing

Perl v5.20 offers an experimental form of dereferencing. Instead of the complicated way I’ll explain in the moment, the new postfix turns a reference into it’s contents. Since this is a new feature, you need to pull it in with the feature pragma (although this feature in undocumented in the pragma docs) (Item 2. Enable new Perl features when you need them. and turn off the experimental warnings: » Read more…

Perl v5.20 combines multiple my() statements

Perl v5.20 continues to clean up and optimize its internals. Now perl optimizes a series of lexical variable declarations into a single list declaration. » Read more…

In v5.20, -F implies -a implies -n

Perl was once known for its one-liners in its sysadmin heydays. People would pass around lists of these one liners, many of which replaced complicated pipelines that glued together various unix utilities to do some impressive system maintenance. » Read more…

Perl 5.20 introduces “Key/Value Slices”

Perl v5.20 adds the “Key/Value Slice”, which extracts multiple keys and their corresponding values from a container (hash or array). It uses the %, which is new, legal syntax for a variable name with subscripts after it: » Read more…

7ads6x98y