Know split’s special cases – The Effective Perler

Perl’s split has some special cases and some perhaps surprising cases. The empty pattern, zero width match, the special argument ' ', and the /^/ act differently than you might expect from the general rule.

The empty pattern, //

The empty pattern is a special case that’s designed to give you a list of characters. This pattern specifically has nothing in it and is different than a pattern that matches an empty string (that’s next). For this, split returns a list of characters:

use utf8;

use Data::Printer;
my @characters = split //, 'BÃ¼ster';

p( @characters );

The output shows a list of characters:

[
    [0] "B",
    [1] "Ã¼",
    [2] "s",
    [3] "t",
    [4] "e",
    [5] "r"
]

This is specifically characters, not grapheme clusters. Depending on the normalization of your source code or input, you can get different results:

use utf8;

use Data::Printer;
use Unicode::Normalize qw(NFD);

my @characters = split //, NFD( 'BÃ¼ster' );

p( @characters );

Now the grapheme cluster Ã¼ is actually two characters, the u (U+0075 ÊŸá´€á´›ÉªÉ´ êœ±á´á´€ÊŸÊŸ ÊŸá´‡á´›á´›á´‡Ê€ á´œ) and the Â¨ (U+0308 á´„á´á´Ê™ÉªÉ´ÉªÉ´É¢ á´…Éªá´€á´‡Ê€á´‡êœ±Éªêœ±), instead of the single Ã¼ (U+00FC ÊŸá´€á´›ÉªÉ´ êœ±á´á´€ÊŸÊŸ ÊŸá´‡á´›á´›á´‡Ê€ á´œ á´¡Éªá´›Êœ á´…Éªá´€á´‡Ê€á´‡êœ±Éªêœ±):

[
    [0] "B",
    [1] "u",
    [2] "Â¨",
    [3] "s",
    [4] "t",
    [5] "e",
    [6] "r"
]

You can review grapheme clusters in Treat Unicode strings as grapheme clusters.

Matching the empty string

Successfully matching no characters isn’t really a special case, but people are sometimes surprised about it because it seems special. Your split pattern might match an empty string. This is different from the empty pattern because you actually have a pattern, even if it might match zero characters. This is also distinct from a pattern that doesn’t match:

use v5.10;

my $_ = 'Mimi';

say "Matched empty pattern" if //;
say "Matched optional whitespace" if /\s*/;
say "Matched zero width assertion" if /(?=\w+)/;
say "How did Buster match?" if /Buster/;

The first three of these patterns match successfully but matches zero characters, while the fourth fails:

Matched empty pattern
Matched optional whitespace
Matched zero width assertion

It’s easy to construct a pattern that will match zero characters even though it matches successfully. The ? (zero or one) and * (zero or more) quantifiers do that quite nicely. Zero width assertions, such as the boundaries and lookarounds, do that too. If the pattern can match zero characters successfully, Perl splits into characters:

use Data::Printer;
my @characters = split /\s*/, 'Buster';

p( @characters );

[
    [0] "B",
    [1] "u",
    [2] "s",
    [3] "t",
    [4] "e",
    [5] "r"
]

The pattern doesn’t have to match zero characters for all separators.

use Data::Printer;
my @characters = split /\s*/, 'Buster and Mimi';

p( @characters );

Notice that there are no spaces in @characters, since split matched those as separator characters:

[
    [0]  "B",
    [1]  "u",
    [2]  "s",
    [3]  "t",
    [4]  "e",
    [5]  "r",
    [6]  "a",
    [7]  "n",
    [8]  "d",
    [9]  "M",
    [10] "i",
    [11] "m",
    [12] "i"
]

The single space, ‘ ‘

The single space in quotes, single or double, is a special case. It splits on whitespace, but unlike the pattern that is a single space, the one in quotes discards empty leading fields:

use Data::Printer;
my @characters = split ' ', '  Buster and Mimi';

p( @characters );

You get just the non-whitespace with no empty fields:

[
    [0] "Buster",
    [1] "and",
    [2] "Mimi"
]

This behavior comes from awk:

#!/usr/bin/awk -f
BEGIN {
    string="  Buster Mimi Roscoe";
    search=" ";
    n=split(string,array," ");
    print("[");
    for (i=1;i<=n;i++) {
        printf("    [%d] \"%s\"\n",i,array[i]);
    }
    print("]");
    exit;
}

You end up with almost the same input, although the indices are one greater:

[
    [1] "Buster"
    [2] "Mimi"
    [3] "Roscoe"
]

Back in Perl, if you tried that with the normal match operator delimiters, you get a different result:

use Data::Printer;
my @characters = split / /, '  Buster and Mimi';

p( @characters );

This time you kept the empty leading fields:

[
    [0] "",
    [1] "",
    [2] "Buster",
    [3] "and",
    [4] "Mimi"
]

If you include the m in front of the quotes though, you lose the special magic:

use Data::Printer;
my @characters = split m' ', '  Buster and Mimi';

p( @characters );

The empty leading fields are back:

[
    [0] "",
    [1] "",
    [2] "Buster",
    [3] "and",
    [4] "Mimi"
]

Splitting lines

The special pattern of just the beginning-of-line anchor, even without the /m flag, breaks a multi-line string into lines:

use Data::Printer;

my $string = <<'HERE';
Line one
Line two
Line three
HERE

my @lines = split /^/, $string;

p( @lines );

Even without the /m you get separate lines:

[
    [0] "Line one
",
    [1] "Line two
",
    [2] "Line three
"
]

This only works if the pattern is exactly /^/. If you put anything else in the pattern, you don't get the special behavior, even if it's a zero width match:

...; # same as before

my @lines = split /^(?=Line)/, $string;  # Oops

p( @lines );

Now there's only one field:

[
    [0] "Line one
Line two
Line three
Line four
"
]

Things to remember

The empty pattern // splits on characters, but not grapheme clusters
A zero width successful match splits on characters too
The single space in quotes splits on whitespace and discards leading empty fields
The ^ anchor by itself splits into lines, even without the /m

2 thoughts on “Know split’s special cases”

google.com/accounts/o8… says:

November 28, 2011 at 12:14 pm

Careful there, split may not actually have a // special case, but match *does*.
google.com/accounts/o8â€¦ says:

July 3, 2012 at 5:45 pm

Thanks, didn’t know about the split lines case. I usually want to discard my line endings but this would come in handy when I want to keep them.

Comments are closed.