Understand the order of operations in double quoted contexts

Perl’s powerful string manipulation tools include case-shifting operators that change the parts of a double-quoted string. There are many other things that happen in a double-quoted string too, so you need to know where these operators fit in with each other.

A double-quoted string has three features:

  • Variable interpolation
  • Escaped and logical characters
  • Case shift operators

You might have missed this because the documentation doesn’t emphasize it. There is a single sentence in perlop, but in relation to the regular expression operators and the \Q:

For double-quoted strings, the quoting from \Q is applied after interpolation and escapes are processed.

If you don’t pay attention to the order of these operations, you’ll get results that you might not expect. The problem is that the order of operations isn’t the same in all double-quoted contexts.

In strings, the order of operations is the same as listed earlier:

  • Variable interpolation
  • Escaped and logical characters
  • Case shift operators

Variable interpolation

You already know about variable interpolation. This is one of Perl’s greatest features, and the one I miss the most when I have to use a different language:

my $cat = 'Roscoe';
my $string = "Buster $cat Mimi";

In a double quoted context, Perl substitutes the value of $cat. You end up with Buster Roscoe Mimi.

Case-shift operators

The case-shift operators change parts of a double-quoted string. Although we call them “case shift”, not all of them change the case.

Operator Effect Function equivalent
\U Uppercase everything following uc
\u Uppercase the next character ucfirst
\L Lowercase everything following lc
\l Lowercase the next character lcfirst
\F (v5.16) Lowercase everything following fc
\Q Quote metacharacters quotemeta
\E Stop whatever you were doing

The \F and fc are new for the yet unreleased Perl v5.16. Those will show up in a different Item. Notice there’s no \f for a fcfirst. That double-quoted sequence already means “form feed”, the instruction to printers to stop the current page and start a new page.

Look at some examples using these in a double-quoted string:

% perl -e 'print "\ubuster\n"'
Buster
% perl -e 'print "\LBUSTER\n"'
buster
% perl -e 'print "\Ubuster\n"'
BUSTER
% perl -e 'print "\Ubus\Eter\n"'
BUSter
% perl -e 'print "\LBUST\EER\n"'
bustER
% perl -e 'print "\QP*rl\n"'
P\*rl\

That last one is a bit odd. It looks like it ends with a \. It doesn’t really end like that because there’s a newline that \Q quoted:

% perl -e 'print "\QP*rl\n"' | hexdump -C
00000000  50 5c 2a 72 6c 5c 0a                 |P\*rl\.|
00000007

Perl handled the “\n” before it handled the \Q, but the meta-character quoter thinks the newline is a special character so it escapes it. An escaped newline is just a newline, though.

Now, combine these with variable interpolation. Perl handles the variables first then does the case shifting:

use 5.14.1;

my $cat = 'Buster';

say "Roscoe $cat Mimi";
say "Roscoe \U$cat Mimi";
say "Roscoe \U$cat\E Mimi";

The results are probably not surprising. The first line is just interpolation, the second line uppercases everything from \U to the end, and the third line uppercases only the parts between the \U and the \E:

Roscoe Buster Mimi
Roscoe BUSTER MIMI
Roscoe BUSTER Mimi

If the case shift happens after interpolation, you might think that you could interpolate a case shift:

use 5.14.1;

my $cat = '\UBuster'; # no case shift in a single quote!

say "Roscoe $cat Mimi";

That doesn’t work though. The intended case shift operator shows up as literal characters because Perl doesn’t do double processing:

Roscoe \UBuster Mimi

A \U inside the string doesn’t bother the escaped characters because Perl has already processed those:

use 5.14.1;

my $cat = 'Buster';

say "Roscoe \U$cat\a\n Mimi";

The “\n” is still a newline and the “\a” is still the bell, and everything after the \U is uppercased (if it has an uppercase equivalent).

That seems simple enough. It’s variable interpolation followed by character escapes followed by case shifting. But this is Perl, so it can’t be that easy.

Regular expression double quoting

The regular expression operators (qr, m//, and s///) handle the double quote operations differently. From perlop:

For the pattern of regex operators (qr//, m// and s///), the quoting from \Q is applied after interpolation is processed, but before escapes are processed.

Now the order is of operations is:

  • Variable interpolation
  • Case shift operators
  • Escaped and logical characters

You can see this when you print the stringified forms of the patterns:

% perl -le 'print qr/\Q\n/'
(?-xism:\\n)
% perl -le 'print qr/\U\n/'
(?-xism:\N)

You probably expect all of these to match, but not all of them do:

% perl -le 'print "\n" =~ qr/\n/ ? "Yes" : "No"'
Yes
% perl -le 'print "\n" =~ qr/\Q\n/ ? "Yes" : "No"'
No
% perl -le 'print "\n" =~ qr/\U\n/ ? "Yes" : "No"'
No
% perl -le 'print "\n" =~ qr/\l\n/ ? "Yes" : "No"'
Yes
% perl -le 'print "\n" =~ qr/\L\n/ ? "Yes" : "No"'
Yes

The last two times are curious. The \l and \L leave the n as a lowercase n so in the last step, the \n is still a newline. Those two tests still match.

This means that you can construct a string and a pattern with the same sequence of characters, but they might not match:

% perl -le 'print "\Q\n" =~ qr/\Q\n/ ? "Yes" : "No"'
No
% perl -le 'print "\U\n" =~ qr/\U\n/ ? "Yes" : "No"'
No
% perl -le 'print "\L\n" =~ qr/\L\n/ ? "Yes" : "No"'
Yes

It’s even worse. What does the \N mean? It depends on the Perl version:

% perl5.10.1 -le 'print "\n" =~ qr/\N/ ? "Yes" : "No"'
Missing braces on \N{} in regex; marked by <-- HERE in m/\N <-- HERE / at -e line 1.
% perl5.12.1 -le 'print "\n" =~ qr/\N/ ? "Yes" : "No"'
No
% perl5.14.1 -le 'print "\n" =~ qr/\N/ ? "Yes" : "No"'
No

Perl v5.12 added \N as "not a newline" to replace the . no matter which default regex switches you have. That's why Perl v5.10 thinks you have an incomplete \N{CHARNAME}. The others match a newline because the case shift happens in the middle of the process:

% perl5.10.1 -le 'print "\n" =~ qr/\L\N/ ? "Yes" : "No"'
Yes
% perl5.14.1 -le 'print "\n" =~ qr/\L\N/ ? "Yes" : "No"'
Yes
% perl5.8.9 -le 'print "\n" =~ qr/\L\N/ ? "Yes" : "No"'
Missing braces on \N{} at -e line 1, near "\L"
Execution of -e aborted due to compilation errors.

With the \N{CHARNAME} syntax, you can match characters by their name in the Universal Character Set. Here you match an uppercase A:

% perl -Mcharnames=:full -le 'print "A" =~ qr/\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
Yes

If you put a \L in front of that, you might think it would match the lowercase version of the named letter. There's no such luck because the \L affects the pattern before the \N{CHARNAME}:

% perl -Mcharnames=:full -le 'print "A" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
No
% perl -Mcharnames=:full -le 'print "a" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
No

The \N turns into a newline and the braces are now for a quantifier with a non-number in it:

% perl -Mcharnames=:full -le 'print qr/\L\N{LATIN CAPITAL LETTER A}/'
(?-xism:\n{u+41})

You might think that this would match anything since that should probably turn into \n{0} just like the values in the array index turn into integers. The perlre section on "Quantifiers" don't say what should happen, but if it's not a number, the braces become literals. Here's a simple demonstration that those braces are literals:

% perl -le 'print "\n{a}" =~ qr/\n{a}/ ? "Yes" : "No"'
Yes

Here's the pattern you created before, and that you want to match now:

% perl -Mcharnames=:full -le 'print qr/\L\N{LATIN CAPITAL LETTER A}/'
(?^u:\n{u+41})

It doesn't match a lowercase a:

$ perl -Mcharnames=:full -le 'print "a" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
No

It doesn't match a newline either. The pattern in \n{u+41} and that's not a quantifier. There are some characters after the \n, so the target string doesn't have enough characters to match:

% perl -Mcharnames=:full -le 'print "\n" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
No

Using the regular expression text doesn't work either, which you might miss on the first pass:

% perl -Mcharnames=:full -le 'print "\n{u+41}" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
No

Of course! That + is a quantifier, so it isn't a literal character that should show up in the string. So this works:

% perl -Mcharnames=:full -le 'print "\n{u41}" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
Yes

This works too because you can have one or more of u:

% perl -Mcharnames=:full -le 'print "\n{uuuuu41}" =~ qr/\L\N{LATIN CAPITAL LETTER A}/ ? "Yes" : "No"'
Yes

If you don't want the \L to extend into character name sequence, you can use the \E to limit its effect:

% perl -Mcharnames=:full -le 'print "bar" =~ qr/\LB\E\N{LATIN SMALL LETTER A}r/ ? "Yes" : "No"'
Yes

Things to remember

  • The double quote string constructor handles variable interpolation, special characters, and case shift operators in that order.
  • The regular expression operators handles variable interpolation, case shift operators, and special characters in that order.
  • Double-quoted interpolation in a match operator happens before regular expression compilation.
  • The min-max quantifier is only a quantifier if you give it numbers. Otherwise, it's literal characters.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Leave a comment

0 Comments.

Leave a Reply

You must be logged in to post a comment.