The \R generic line ending – The Effective Perler

Perl v5.10 adds a regular expression shortcut \R that matches anything the Unicode specification thinks is a line ending. It looks similar to a character class shortcut but it’s not. It can match the sequence of carriage-return line-feed, but character classes don’t match sequence.

Some History

When the advanced civilizations take over the world and study our history, the fractured notion of line endings will be one of the things they joke about.

You’ve probably cursed the differences already. Windows does it one way, Unix does it another way, and the old Mac Classic thought differently. Then, with the Universal Character Set, we have even more possibilities. Consider what we have now:

Unix (from Multics) uses a line feed
Windows (from DOS from CP/M) uses a carriage return / line feed pair
Some internet protocols mandate a carriage return / line feed pair
Some systems try to translate one to the other based on what it thinks you are doing (e.g. FTP’s text mode). Thanks Jon Postel!
Unicode adds characters for line ending semantics.
Vestiges of Mac Classic still use bare carriage returns

Much of this is related to the physical control of computer hardware. A carriage return moved the bit holding the paper (a thin sheet of pressed plant or synthetic fibers usually cut into a rectangular shape) back to column one (the horizontal start). A newline moved that paper vertically, to the next line.

Physical typewriters used by typists moved the paper in both dimensions because humans typed sequentially. Maybe you’ve seen it in movies:

There’s a story in Sockets, Shellcode, Porting, and Coding that the Teletype Model 33 had a two-tenths of a second lag to do a carriage return but would lose a character that showed up during that time. The added the line feed character after the carriage return to take up some of that lag time while buffering additional input.

On top of those ideas, stack the various social conventions we’ve added. What is a line in a computer file? Larry Wall has said that he can program anything in one line of Perl given a sufficiently long line. What’s a line from a file mean when we can set the input record separator ($/)? Is a line a division of display, storage, semantics, or something else?

The Unicode standard specifies what it thinks a newline is (Chapter 5 “Implementation Guidelines”, section 8 in The Unicode Standard Version 8.0)

The generic linefeed

You’re just a lowly programmer who doesn’t care about any of this. You merely want to read a document line-by-line without caring about how you know where the end of a line is. Alas, we’re not all the way there because we still can’t put a regex in the input record separator ($/). However, we can recognize where a line “ends” in a big hunk of text we already have.

Perl defines \R as a carriage-return/line-feed pair or vertical whitespace in a non-backtracking subpattern:

(?>\r\n|\v)

You can expand that to this alternation in a non-backtracking subpattern:

(?>
	  \r\n     # carriage return / line feed pair
	| \n       # line feed (LF)
	| \r       # carriage return (CR)
	| \x0b     # vertical tab (VT)
	| \f       # form feed
	| \x85     # next line (NEL)
	| \x2028   # line separator
	| \x2029   # paragraph separator
)

Also see Item 30. Use more precise whitespace character classes., Know your character classes under different semantics, and The vertical tab is part of \s in Perl 5.18.

Here’s a program that inserts a literal \R where Perl thinks there is a generic line ending. To make it easier to read the output, I also replace literal characters such as a newline with literal \n:

use v5.10;

my $string = "New line\ncarriage return\rCRLF\r\nLFCR\n\rvertical tab\x85form feed\flast";

(my $slashR = $string) =~ s/(\R)/$1\\R/g;
say show_escapes( $slashR );

sub show_escapes {
	local $_ = shift;
	s/\n/\\n/g;
	s/\r/\\r/g;
	s/\f/\\f/g;
	s/\x85/\\v/g;
	$_;
	}

You can see where the generic line endings are. Notice that after the LFCR substring, there’s a generic line ending after the \n and another one after the \r. They count separately since on the reverse order, \r\n counts as a unit:

New line\n\Rcarriage return\r\RCRLF\r\n\RLFCR\n\R\r\Rvertical tab\v\Rform feed\f\Rlast

You might use this new feature to replace all line endings with the one that you want:

$text =~ s/\R/\n/g;

Or to split up text:

my @lines = split /\R/, $text;

Things to remember

The \R matches the Unicode notion of a generic newline.
A generic newline is (?>\r\n|\v).
The \R is not a character class shortcut.