Insignificant whitespace in brace constructs

This is a chapter in Perl New Features, a book from Perl School that you can buy on LeanPub or Amazon. Your support helps me to produce more content.



Perl’s coterie of brace constructs become a bit more lenient in v5.34. These things appear in double-quotish constructs, such as \N{CHARNAME} to specify a character by name. And, patterns count as a double-quoted construct (unless you use ' as the delimiter), so these new rules apply to brace constructs such as \k{} (for named backreferences) and the general quantifier, {n,m}.

Continue reading “Insignificant whitespace in brace constructs”

Insignificant leading or trailing whitespace in brace constructs

This is a chapter in Perl New Features, a book from Perl School that you can buy on LeanPub or Amazon. Your support helps me to produce more content.



Perl’s coterie of brace constructs become a bit more lenient in v5.34. These things appear in double-quotish constructs, such as \N{CHARNAME} to specify a character by name. And, patterns count as a double-quoted construct (unless you use ' as the delimiter), so these new rules apply to brace constructs such as \k{} (for named backreferences) and the general quantifier, {n,m}.

Specifying characters

These constructs apply to double-quotish interpretation to specify a character by its codepoint or name:

Construct Description/th>

Item
\N{CHARNAME} Character name item
\o{177} Octal code point item
\x{ABCD} Hex code point

There are already loose names for \N{} that ignores whitespace (item), but this feature is a bit different. It ignores horizontal whitespace around a value (but not inside a value):

use v5.10;
use open qw(:std :utf8);

say <<~"HERE";
	Cat face: \N{ BLACK SPADE SUIT }
	Octal:    \o{ 23140 }
	Hex:      \x{ 2660 }
	HERE

This outputs the character we expect:

$ perl5.34.0 whitespace.pl
Spade suit: ♠
Octal:      ♠
Hex:        ♠

If you add space within the value, you don't get the character you want (the \N{} will actually fail):

use v5.34;
use open qw(:std :utf8);

say <<~"HERE";
	Octal:    \o{ 231 40 }
	Hex:      \x{ 26 60 }
	HERE

This discards the cruft once it encounters non-digit characters (just like Perl's string-to-number conversions). This is effectively:

use v5.34;
use open qw(:std :utf8);

say <<~"HERE";
	Octal:    \o{ 231 }
	Hex:      \x{ 26 }
	HERE

It's even worse. You can extra nonsense after the code number and v5.34 will ignore it. Although these have illegal digits (along with the internal space), they still work:

use v5.34;
use open qw(:std :utf8);
use warnings;

say <<~"HERE";
	Octal:    \o{ 231 abc }
	Hex:      \x{ 26 xyz }
	HERE

With trailing tabs or spaces, warnings says that it ignores the cruft and uses what it received so far:

Non-octal character ' ' terminates \o early.  Resolved as "\o{231}" at ...

With leading tabs or spaces, earlier Perls give up right away and uses the null character. The warning from v5.32 is this:

Non-octal character ' ' terminates \o early.  Resolved as "\o{000}" at...

Finally, the whitespace can't be vertical space or other double-quote escapes (it's just literal tabs or spaces). These don't work:

\o{\t231}
\o{
	231 }

In regular expressions, this fails before Perl interprets the pattern, where the /x would be able to handle the vertical whitespace. This would match a null byte because the string-to-number parsing stops at the first newline, returning \000:

m/\o{
	231
	}/x;

In regular expressions

And these constructs apply to regular expression features, and you don't need the /x flag to get this new, insignificant whitespace:

Construct Description Chapter
\b{TYPE} Word boundary Item
\g{N} Numbered backreference Item 31 (book)
\g{NAME} Named backreference Item 31 (book)
\k{NAME} Named backreference Item 31 (book)
\p{PROPNAME} Unicode property name
\P{PROPNAME} Unicode property name
\x{ABCD} Hex code point
{n,m} general quantifier

The rules for these are similar to the same as those from the previous section. Perl ignores the tabs or spaces at the beginning
or the end, but not in the middle (aside from around the , in {n,m}). For example, these all work:

use v5.10;
use open qw(:std :utf8);
use warnings;

$_ = 'aa';

my @patterns = (
	qr/(.)\g{ -1 }/,
	qr/(?.)\g{ first }/,
	qr/(?.)\k{ first }/,
	qr/\b{ sb }(.)/,
	qr/(\o{ 141 })\g{ -1 }/,
	qr/(\p{Letter})\g{ -1 }/,
	qr/(.)\g{ -1 }/,
	qr/(\x{ 61 })\g{ -1 }/,
	);

foreach my $pattern ( @patterns ) {
	say /$pattern/
	};

“Up to N” matches

This is a chapter in Perl New Features, a book from Perl School that you can buy on LeanPub or Amazon. Your support helps me to produce more content.



Perl’s general regex quantifier, {n,m} takes a minimum and maximum number of matches. If you leave out the maximum number, like {n,}, you have to match the preceding thing at least n times but as many times as it can match: the maximum is unbounded.

Continue reading ““Up to N” matches”

Specify octal numbers with the 0o prefix

Perl v5.34 allows you to specify octal literals with the 0o prefix, as in 0o123_456. This is consistent with the existing constructs to specify hexadecimal literal 0xddddd and binary literal 0bddddd. The builtin oct() function accepts any of these forms.

Previously, you specified octal with just a leading zero:

chmod 0644, $file;
mkdir 0755, $file;

Now you can do that an extra character that specifies the base:

chmod 0o644, $file;
mkdir 0o755, $file;

This makes it consistent with 0b for binary and 0x for hexadecimal. See “Scalar value constructors” in perldata.

And, remember that v5.14 added the \o{NNN} notation to specify characters by their octal number. We’re still waiting for octal floating point values (we got the hex version in v5.22), but don’t hold your breath.

Perhaps we’ll get 0d sometime so that all the bases.

Match Unicode property values with a wildcard

Not all characters with a numerical value are “digits”, in the Perl sense. You saw in Match Unicode characters by property value, in which you selected characters by a Unicode property value. That showed up again in Perl v5.18 adds character class set operations. Perl v5.30 adds a slightly easier way to get at multiple numerical values at the same time. Now you can match Unicode property values with wildcards (Unicode TR 18), which are sorta like Perl patterns. Don’t get too excited, though, because these can be expensive.

Continue reading “Match Unicode property values with a wildcard”

Match Unicode character names with a pattern

Perl has some of the best Unicode support out there, and it keeps getting better. Perl v5.32 supports Unicode 13, and you can now apply patterns to character names. You probably don’t want to do that though.

First, the Unicode Character Database catalogs each character, giving it a code number, a name, and many other properties.

Continue reading “Match Unicode character names with a pattern”

Turn off indirect object notation

Perl v5.32 adds a way to turn off a Perl feature that you shouldn’t use anyway. You can still use this feature, but now there’s a way to take it away from you. And, with the recent Perl 7 announcement, we see why. Eventually Perl wants to get rid of indirect object notation (and I explain that more in Preparing for Perl 7.

Continue reading “Turn off indirect object notation”

Chain comparisons to avoid excessive typing

Checking that a value is between two others involves two comparisons, and so far in Perl that’s meant that you’ve had to type one of the values more than once. That gets simpler in v5.32 with chained comparisons. This would make Perl one of the few languages that support the feature. So far its implemented in v5.31.10 and until v5.32 is actually released, it isn’t a real feature.

Continue reading “Chain comparisons to avoid excessive typing”

Use a variable-width lookbehind if it won’t match more than 255 characters

In Ignore part of a substitution’s match, I showed you the match resetting \K—it’s basically a variable-width positive lookbehind assertion. It’s a special feature to work around Perl’s lack of variable-width lookbehinds. However, v5.30 adds an experimental feature to allow a limited version of a variable-width lookbehind.

Continue reading “Use a variable-width lookbehind if it won’t match more than 255 characters”