Use the \N regex character class to get “not a newline”

Perl 5.12 introduced an experimental regex character class to stand in for every character except one, the newline. The \N character class is everything but the newline.

In prior versions of Perl, this is the same thing as the . meta character. That is, it’s the same as long as someone doesn’t add the /s to the match or substitution operator or the regex quoting operator, or doesn’t insert the option inside the pattern:

m/.+/;       # only matches the next line in the string
m/.+/s;      # now matches a newline, so the rest of the string
m/(?s:.+)/s; # now matches a newline, so the rest of the string

s/.+//;      # only replaces the next line in the string
s/.+//s;     # now replaces the rest of the string

qr/$regex/s;

I’ve encountered some customer programs that suddenly broke because a young Perl cowboy, fresh from reading Perl Best Practices, blindly applied the recommended /xism to the end of every match operator, thus breaking some of the patterns. It’s either happened to you or will happen to you. The trick is to code defensively so when the new PBP zealot shows up, you give him fewer opportunities to break things.

This odd double-duty of . is a wart on Perl’s regular expressions since it’s one of the few areas where the pattern can change drastically to something that you didn’t intend based on the options you (or someone else) apply, even dynamically at runtime. To get around this, you have to use the non-lazy and certainly ugly negated character class, [^\n], even though it is the safer thing to do. You should always avoid ambiguity and future breakage by writing as specific and as immutable a pattern as you can stomach. That thought deserves it’s own, future Item.

Instead of that [^\n], Perl 5.12 replaces that with the much nicer looking \N.

There are a couple of gotchas with \N though. You can’t use it in a character class like you can with other character class shortcuts, like [\d\s] or even the idiom to match all characters, [\d\D] . That’s just a tiny gotcha, though. The bigger gotcha has to do with something else that already uses \N.

Remember “experimental”?

Aren’t you glad the Perl developers figured that one out? How much would you pay for a nicer regex feature like that? $100? $40? $20? What if I told you that you could get it for free, as long as you can stand the new problem it presents? The \N sequence is also the start of the \N{character name} sequence to specify a Unicode character by name in a double-quoted string. What if you want to build a regex with interpolation:

use charnames ':full';

my $string = <<"HERE";
this is a line
this is another line
abcBuster Mimixyz
this is the last line
HERE

my $stuff = $ARGV[0];

$string =~ m/abc(\N{$stuff}).*xyz/; # doesn't work

print "Matched [$1]\n";

Is that set of braces around a number, a pair of numbers, or a character name? What's in $stuff? Are you going to get a quantifier or a Unicode character? Actually, it doesn't matter. You can't interpolate in the \N{...} inside a regex: you can only have what looks like quantifier characters or name characters. The sigil in a Perl variable isn't a legal character for either of those, so Perl shuts you down before you get the chance to mess up.

Unknown charname '$stuff' at /usr/local/perls/perl-5.12.2/lib/5.12.2/unicore/Name.pl line 14
Deprecated character(s) in \N{...} starting at '$stuff' at test line 14.

Problem solved? Not so fast. As you saw in Item 74: Specify Unicode characters by code point or name., you can create aliases for Unicode character names:

use charnames ':full', ':alias' => {
	LONGEST     => 'ARABIC LIGATURE ...',
	OMG_PIRATES => 'SKULL AND CROSSBONES',
	RQUOTE      => 'RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK',
	LQUOTE      => 'LEFT-POINTING DOUBLE ANGLE QUOTATION MARK',
	};

What if you want to use a number as your alias, even for something completely unrelated to your regular expression?

use charnames ':full', ':alias' => {
	3     => 'DIGIT THREE',
	};

That alias means that in a normal double-quoted string, the "\N{3}" ends up as just 3. But which one is \N{3} in the regular expression where you might want those braces to be quantifiers instead? It turns out that it is always the quantifier. You can use digits to create aliases, but those aliases don't apply in the pattern as long as you are using the \N{3} in something that realizes it is compiling a pattern, such as m//, s///, or qr//, but not strings that lead up to patterns. These are different:

use charnames ':full', ':alias' => {
	3     => 'DIGIT THREE',
	};

my $pattern = "abc\N{3]xyz";
$string =~ m/$pattern/;    # \N{3} pre-interpolated, so 'abc3xyz'

$string =~ m/abc\N{3}xyz/; # no interpolation, so "abc\N\N\Nxyz"

Since there are different rules for interpolation in the strings and the regex, you have to be careful about constructing your regular expressions.

Things to remember

  • The \N in a regex is anything not a newline, independent of the absence or presence of a /s flag.
  • The \N{...} where the ... is a digit is only "not a newline" followed by a quantifier, despite the normal double-quoted rules and user-defined aliases.
  • Don't use digits as character name aliases.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Leave a comment

2 Comments.

  1. Should the comment in line 8 of your final snippet comment replace ‘\n’ with either ‘[^\n]‘ or ‘\N’?
    If not, I think I’m confused.

    > ?$string =~ m/abc\N{3]xyz/; # no interpolation, so “abc\n\n\nxyz”

    ?$string =~ m/abc\N{3]xyz/; # no interpolation, so “abc[^\n][^\n][^\n]xyz”
    aka
    ?$string =~ m/abc\N{3]xyz/; # no interpolation, so “abc\N\N\Nxyz”

Leave a Reply

You must be logged in to post a comment.