Slurp a file from the command line with -g

This is a chapter in Perl New Features, a book from Perl School that you can buy on LeanPub or Amazon. Your support helps me to produce more content.


Perl v5.36 adds the -g switch as a shortcut for -0777, which undefines the input record separator so you can read an entire file as a single string. This is often called “slurping”, and is useful when you need to process text that spans several lines.

The input record separator

The input record separator is the character (or characters) that Perl’s line-input operator uses to determine when a line has ended. By default, that’s a newline (U+0010), but you can use any string you like by setting $/ ($INPUT_RECORD_SEPARATOR). Sometimes the form feed is a useful separator for multiline records:

$/ = "\f";

On the command line, the -0 switch is a quick way to set the value for $/. Without a value, it uses the null byte, which is sometimes a useful as a separator:

% perl -MO=Deparse -0 -e 1
BEGIN { $/ = "\000"; $\ = undef; }
'???';
-e syntax OK

A number in octal or hexadecimal sets $/ to some other single character:

% perl -MO=Deparse -0014 -e 1
BEGIN { $/ = "\f"; $\ = undef; }
'???';
-e syntax OK
% perl -MO=Deparse -0xC -e 1
BEGIN { $/ = "\f"; $\ = undef; }
'???';
-e syntax OK

Any number above 0377 octal (more than 255 decimal) sets $/ to undef:

% perl -MO=Deparse -0377 -e 1
BEGIN { $/ = "\377"; $\ = undef; }
'???';
-e syntax OK
% perl -MO=Deparse -0400 -e 1
BEGIN { $/ = undef; $\ = undef; }
'???';
-e syntax OK

Conventionally, though, the Perl documentation has used 777 as the value to get undef probably since it’s easier to remember:

% perl -MO=Deparse -0777 -e 1
BEGIN { $/ = undef; $\ = undef; }
'???';
-e syntax OK

The new -g is a short synonym for -0777, so it does the same thing that :

% perl -MO=Deparse -e 1
'???';
-e syntax OK
% perl -MO=Deparse -g -e 1
BEGIN { $/ = undef; $\ = undef; }
'???';
-e syntax OK

Far more than you ever wanted to know

The -0 switch has some other interesting behavior, and has a few other interesting features. Since I’m already writing about this feature, I might as well keep going.

Single-character line ending

You can use an octal or hexadecimal number after the -0 to choose the single character that you want to use as the line ending. I’ve often used the form feed (U+000C) to separate multi-line records. The particular character doesn’t matter as long as it doesn’t appear in the data (so the null byte might be useful too):

% perl -le "print qq(one\ntwo\nthree\n\fA\nB\nC\n\f9\n10\n11\n)" > formfeed.txt

When you read F by lines with no change to the input record separator, you see the three records separated by “blank” lines, which are really the form feed:

% perl -ne 'print' formfeed.txt
one
two
three

A
B
C

9
10
11

You can see that easier when you replace the invisible characters with their ordinal values, which you do in octal here:

% perl -pe 's/(\P{Print})/sprintf(q(%03o),ord($1)) . "\n"/eg' formfeed.txt
one012
two012
three012
014
A012
B012
C012
014
9012
10012
11012
012

When you use the octal value of the form feed for the number after the -0 switch and output lines surrounded by angle brackets, you get three lines (with the newlines and line-ending form feed in tact):

% perl -014 -ne 'print qq(<$_>)' formfeed.txt
<one
two
three

><A
B
C

><9
10
11

>

You could have also specified this with three digits, -0014, or as hexadecimal with a leading x, like -0xC. The hexadecimal version is valuable when you need to specify a character past the largest single octet value you can get out of three octal digits, which is 0377.

There’s a catch though. If you want to set the input record separator to a wide character, you need to ensure that you read the input correctly. For the ☃ (U+2603 SNOWMAN) to be the separator, which takes up three octets in UTF-8, you need to read the input as UTF-8 too. The -C is one way to do that:

% perl -0x2603 -C -ne 'print qq(<$_>)' snowmen.txt >>

You aren’t able to specify multiple characters as a line separator since B thinks the extra characters are a file for input:

% perl -MO=Deparse -0x0100x2603 -e
No Perl script found in input

Slurping an entire file

If you specify an octal value 400 or higher, which is more than 8 bits, Perl sets the input record separator to undef. With no defined value for $/, Perl slurps the entire input. But, this is different than setting the empty string (a defined value), which I write about in the next section.

You’ve probably seen -0777, perhaps the most common use of -0:

% perl -0777 -ne 'print qq(<$_>)' dog.txt
<Newfoundland
Golden Retreiver
Boxer
>

That F is actually read through the ARGV filehandle, which does some trickery to make it look like all the input is coming from one source. However, the line input operator can’t read across the command line files; B figures out when one file is empty, closes it, then opens the next file. So, each file appears to be its own line:

% perl -0777 -ne 'print qq(<$_>)'⏎dog.txt cat.txt lizard.txt
<Newfoundland
Golden Retreiver
Boxer
><Tabby
Marmalade
Tiger
><Monitor
Iguana
Godzilla
>

If you wanted all the files to be one lines, route them through standard input before they get to B. This only looks like a useless use of B:

% cat dog.txt cat.txt lizard.txt |⏎perl -0 -ne 'print qq(<$_>)'

=head1 Paragraph mode

“Paragraph mode” is a special case. The -00 sets the input record separator to the empty string. That’s different than the undefined value even though both are false:

% perl -MO=Deparse -00 -e 1
BEGIN { $/ = ""; $\ = undef; }
'???';
-e syntax OK

When the input record separator is the empty string, B treats it as if it is multiple consecutive newlines. This has the same effect as if the input record separator were the pattern \n+ Not only that, put it collapses the multiple newlines to exactly two newlines:

% perl -00 -ne 'print qq(<$_>)' paras.txt
<First line first para
Second line first para
Third line first para

><After first blank line
Second line after first blank line
Third line after first blank line

><After 2nd blank line
2nd line after 2nd blank line
3rd line after 2nd blank line
>

Summary

Here’s a quick summary of the various incantations of the -0 switch:

Switch Input Record Separator Note
-0 \000 null byte
-00 empty string, but “\n+” paragraph mode
-0014 8-bit character, in octal form feed
-0xC 8-bit character, in hex form feed
-0400 undef, above 8-bit slurp
-0777 undef, idiomatic slurp
-g undef slurp, new in v5.36
-0x1FF \777 character, include -C actual \777
-0x2603 wide character, include -C snowman

From the Perl documentation

Iterate over multiple elements at the same time

This is a chapter in Perl New Features, a book from Perl School that you can buy on LeanPub or Amazon. Your support helps me to produce more content.


This feature was promoted to a stable version in v5.40.

Perl v5.36 adds experimental support that allows a foreach (or for) to loop iterate over multiple values at the same time by specifying multiple control variables. This is incredibly cool:

use v5.36;
use experimental qw(for_list);

my @animals = qw( Buster Mimi Ginger Nikki );
foreach my( $s, $t ) ( @animals ) {
	say "$s ^^^ $t";
	}

The output shows two iterations of the loop, each which grabbed two values from the list:

Buster ^^^ Mimi
Ginger ^^^ Nikki

Add another parameter; the list now doesn’t divide evenly between the parameters, so any parameter that can’t match with a list item gets undef, just like normal list assignment:

use v5.36;
use experimental qw(for_list);

foreach my( $s, $t, $u ) ( @animals ) {
	say "$s ^^^ $t ^^^ $u";
	}

Since use v5.36 also turns on warnings, you get those “uninitialized” warnings for free when you use those undef values:

Buster ^^^ Mimi ^^^ Ginger
Nikki ^^^  ^^^
Use of uninitialized value ...
Use of uninitialized value ...

Another interesting use combines the new builtin::indexed feature that gets you the index and value at the same time:

use v5.36;
use experimental qw(for_list builtin);
use builtin qw(indexed);

my @animals = qw( Buster Mimi Ginger Nikki );
foreach my( $i, $value ) ( indexed(@animals) ) {
	say "$i: $value";
	}

That’s a bit nicer than going through the indices to access the value in an additional statement:

foreach my $i ( 0 .. $#animals ) {
	my $value = $animals[$i];
	say "$i: $value";
	}

No placeholders (yet)

So far, this new syntax doesn’t have a way to skip values. In a normal list assignment, you discard a value coming from the right hand list with a literal undef:

my( $s, undef, $t ) = @animals

Try that in the for list and you get a syntax error:

foreach my( $s, undef, $u ) ( @animals ) {  # ERROR!
	say "$s ^^^ $u";
	}

Hash keys and values

I’m tempted to use this for hashes, although each inside a while is still probably better since it doesn’t have to build the entire input list in one go:

use experimental qw(for_list);

my %animals = (
	cats => [ qw( Buster Mimi Ginger ) ],
	dogs => [ qw( Nikki ) ],
	);

foreach my( $k, $v ) ( %animals ) {
	say "$k ^^^ @$v";
	}

Since those hash values are array refs, it would be helpful if this feature could use the refaliasing and declared_refs features (Mix assignment and reference aliasing with declared_refs):

use experimental qw(for_list);
use experimental qw(refaliasing declared_refs);

my %animals = (
	cats => [ qw( Buster Mimi Ginger ) ],
	dogs => [ qw( Nikki ) ],
	);

foreach my( $k, \@v ) ( %animals ) {
	say "$k ^^^ @v";
	}

Sadly, the parser doesn’t expect the reference operator inside that for list:

syntax error ... near ", \"

Doing

Prior to builtin multiple iteration, the best way to do the same thing was probably the List::MoreUtils (not part of core) module. The natatime function, which I wished was named n_at_a_time, grabs the number of elements that you specify and returns them as a list. Since it returns a list instead of an array reference, it’s easier to use it with a while:

use List::MoreUtils qw(natatime);

my @x = ('a' .. 'g');
my $iterator = natatime 3, @x;

while( my @vals = $iterator->() ) {
	print "@vals\n";
	}

Another approach uses splice. The easiest thing might be to do it destructively since that requires no index fiddling:

my @x = 'a' .. 'g';
my @temp = @x;

while( my @vals = splice @temp, 0, 3, () ) {
	print "@vals\n";
	}

Here’s an example from the L documentation that does the same thing:

sub nary_print {
  my $n = shift;
  while (my @next_n = splice @_, 0, $n) {
	say join q{ -- }, @next_n;
  }
}

nary_print(3, qw(a b c d e f g h));
# prints:
#   a -- b -- c
#   d -- e -- f
#   g -- h

Playing with the array indices can get this done, but it comes with a lot of baggage. First, an array slice doesn’t return an empty list, so you can’t use that as a condition in the while as in the previous examples. Since it fills in the missing elements with undef, outputting the values possibly comes with warnings. Even if you want to accept those annoyances, you still have to manage the end of array condition ($#X) yourself:

my @x = 'a' .. 'g';

my $start = 0;
my $n     = 3;

while( $start <= $#x ) {
	no warnings qw(uninitialized);
	my @vals = @x[$start, $start + $n - 1];
	print "@vals\n";
	$start += $n;
	}

So yeah, having a multiple iterator feature built into Perl is a huge win.

Summary

The experimental for_list feature lets you take multiple elements of the list in each iteration. This doesn't yet handle many of the list assignment features that would make this as useful as people will want it to be.

From the Perl documentation

  1. perlsyn

Insignificant whitespace in brace constructs

This is a chapter in Perl New Features, a book from Perl School that you can buy on LeanPub or Amazon. Your support helps me to produce more content.



Perl’s coterie of brace constructs become a bit more lenient in v5.34. These things appear in double-quotish constructs, such as \N{CHARNAME} to specify a character by name. And, patterns count as a double-quoted construct (unless you use ' as the delimiter), so these new rules apply to brace constructs such as \k{} (for named backreferences) and the general quantifier, {n,m}.

Continue reading “Insignificant whitespace in brace constructs”

Insignificant leading or trailing whitespace in brace constructs

This is a chapter in Perl New Features, a book from Perl School that you can buy on LeanPub or Amazon. Your support helps me to produce more content.



Perl’s coterie of brace constructs become a bit more lenient in v5.34. These things appear in double-quotish constructs, such as \N{CHARNAME} to specify a character by name. And, patterns count as a double-quoted construct (unless you use ' as the delimiter), so these new rules apply to brace constructs such as \k{} (for named backreferences) and the general quantifier, {n,m}.

Specifying characters

These constructs apply to double-quotish interpretation to specify a character by its codepoint or name:

Construct Description/th>

Item
\N{CHARNAME} Character name item
\o{177} Octal code point item
\x{ABCD} Hex code point

There are already loose names for \N{} that ignores whitespace (item), but this feature is a bit different. It ignores horizontal whitespace around a value (but not inside a value):

use v5.10;
use open qw(:std :utf8);

say <<~"HERE";
	Cat face: \N{ BLACK SPADE SUIT }
	Octal:    \o{ 23140 }
	Hex:      \x{ 2660 }
	HERE

This outputs the character we expect:

$ perl5.34.0 whitespace.pl
Spade suit: ♠
Octal:      ♠
Hex:        ♠

If you add space within the value, you don't get the character you want (the \N{} will actually fail):

use v5.34;
use open qw(:std :utf8);

say <<~"HERE";
	Octal:    \o{ 231 40 }
	Hex:      \x{ 26 60 }
	HERE

This discards the cruft once it encounters non-digit characters (just like Perl's string-to-number conversions). This is effectively:

use v5.34;
use open qw(:std :utf8);

say <<~"HERE";
	Octal:    \o{ 231 }
	Hex:      \x{ 26 }
	HERE

It's even worse. You can extra nonsense after the code number and v5.34 will ignore it. Although these have illegal digits (along with the internal space), they still work:

use v5.34;
use open qw(:std :utf8);
use warnings;

say <<~"HERE";
	Octal:    \o{ 231 abc }
	Hex:      \x{ 26 xyz }
	HERE

With trailing tabs or spaces, warnings says that it ignores the cruft and uses what it received so far:

Non-octal character ' ' terminates \o early.  Resolved as "\o{231}" at ...

With leading tabs or spaces, earlier Perls give up right away and uses the null character. The warning from v5.32 is this:

Non-octal character ' ' terminates \o early.  Resolved as "\o{000}" at...

Finally, the whitespace can't be vertical space or other double-quote escapes (it's just literal tabs or spaces). These don't work:

\o{\t231}
\o{
	231 }

In regular expressions, this fails before Perl interprets the pattern, where the /x would be able to handle the vertical whitespace. This would match a null byte because the string-to-number parsing stops at the first newline, returning \000:

m/\o{
	231
	}/x;

In regular expressions

And these constructs apply to regular expression features, and you don't need the /x flag to get this new, insignificant whitespace:

Construct Description Chapter
\b{TYPE} Word boundary Item
\g{N} Numbered backreference Item 31 (book)
\g{NAME} Named backreference Item 31 (book)
\k{NAME} Named backreference Item 31 (book)
\p{PROPNAME} Unicode property name
\P{PROPNAME} Unicode property name
\x{ABCD} Hex code point
{n,m} general quantifier

The rules for these are similar to the same as those from the previous section. Perl ignores the tabs or spaces at the beginning
or the end, but not in the middle (aside from around the , in {n,m}). For example, these all work:

use v5.10;
use open qw(:std :utf8);
use warnings;

$_ = 'aa';

my @patterns = (
	qr/(.)\g{ -1 }/,
	qr/(?.)\g{ first }/,
	qr/(?.)\k{ first }/,
	qr/\b{ sb }(.)/,
	qr/(\o{ 141 })\g{ -1 }/,
	qr/(\p{Letter})\g{ -1 }/,
	qr/(.)\g{ -1 }/,
	qr/(\x{ 61 })\g{ -1 }/,
	);

foreach my $pattern ( @patterns ) {
	say /$pattern/
	};

Specify octal numbers with the 0o prefix

Perl v5.34 allows you to specify octal literals with the 0o prefix, as in 0o123_456. This is consistent with the existing constructs to specify hexadecimal literal 0xddddd and binary literal 0bddddd. The builtin oct() function accepts any of these forms.

Previously, you specified octal with just a leading zero:

chmod 0644, $file;
mkdir 0755, $file;

Now you can do that an extra character that specifies the base:

chmod 0o644, $file;
mkdir 0o755, $file;

This makes it consistent with 0b for binary and 0x for hexadecimal. See “Scalar value constructors” in perldata.

And, remember that v5.14 added the \o{NNN} notation to specify characters by their octal number. We’re still waiting for octal floating point values (we got the hex version in v5.22), but don’t hold your breath.

Perhaps we’ll get 0d sometime so that all the bases.

Undef a scalar to release its memory

When you store a large string in a scalar, perl allocates the memory to store that string and associate it with the scalar. It uses the same memory even if you assign a much shorter value to the same scalar. Use the functional form of undef to let perl reuse that memory for something else. This is important when you want to reuse the variable or the lifetime of the variable is very long.

Continue reading “Undef a scalar to release its memory”

Perl v5.26 now recognizes version control conflict markers

Perl v5.26 can now detect and warn you about a version control conflict markers in your code. In prior versions, the compiler would try to interpret those as code and would complain about a syntax error. You program still fails to compile but you get a better error message. Maybe some future Perl will bifurcate the program, run both versions, and compare the results (don’t hold your breath):

Continue reading “Perl v5.26 now recognizes version control conflict markers”

In-place editing gets safer in v5.28

In-place editing is getting much safer in v5.28. Before that, in rare circumstances it could lose data. You may have never noticed the problem and even with all the times I’ve explained it in a Perl class I haven’t really thought about it. This was first reported as early as December 2002 and after we get v5.28 it won’t be a problem anymore. Continue reading “In-place editing gets safer in v5.28”

Beware of the removal of when in Perl v5.28

[Although I haven’t seen an official notice besides a git commit that reverts the changes, by popular outcry these changes won’t be in v5.28. It’s not that they won’t happen but they won’t be in v5.28. People who depend on Perl should stay vigilant. My advice in the first paragraph stands—change is coming and we don’t know what it is yet.]

Perl v5.28 might do away with when—v5.27.7 already has. Don’t upgrade to v5.28 until you know you won’t be affected by this! This change doesn’t follow the normal Perl deprecation or experimental feature policy. If you are using given-when, stop doing that. If you aren’t using it, don’t start. And everyone should consider if a major change like this on such short notice is comfortable for them. It’s not a democracy but you can still let the core developers know which way you want your favorite language to go.

Continue reading “Beware of the removal of when in Perl v5.28”