Use lookarounds to eliminate special cases in split

The split built-in takes a string and turns it into a list, discarding the separators that you specify as a pattern. This is easy when the separator is simple, but seems hard if the separator gets more tricky.

For a simple example, you can split an entry from /etc/password (although getpw* functions will do that for you):

root:*:0:0:System Administrator:/var/root:/bin/sh

The colons separate the fields, so you split on a colon:

my @fields = split /:/, $passwd_line;

That works just fine because the separator is a single character, that character is the same between each field, and the separator character doesn’t appear in any of the data.

A slightly more tricky example has a character from the separator also show up in the data. Consider comma-separated values which also allows a comma in the data. If you really have to do this, you would use a module (Item 115. Don’t use regular expressions for comma-separated values). However, this is a good task to illustrate some of the tricks in this Item. You might see these data stored in many ways. You are likely to see all the fields quoted if any one of them has the comma:

"Buster","Roscoe, Cat","Mimi"

You can split on ",", which separates all the fields:

my $string = q("Buster","Roscoe, Cat","Mimi");

my @fields = split /","/, $string;

$" = "\n";
print "@fields\n";

However, the first and last fields have remnants of the quoting:

"Buster
Roscoe, Cat
Mimi"

In this case, the simple split failed because it only removes text between the fields and doesn’t care at all about text at the beginning of the string or the end of the string.

You might think that you can make special cases to handle the beginning and end of the string bits. Creating special cases is almost always what you want to avoid: they make the code more complicated and they make you think about more than you really need to think about. Still, you can do that with alternations in the pattern:

my $string = q("Buster","Roscoe, Cat","Mimi");

my @fields = split /\A"|","|"\z/, $string;

$" = "\n";
print "@fields\n";

And, it doesn’t work. The split maintains leading open fields, so we get an extra field at the start:


Buster
Roscoe, Cat
Mimi

You could handle that by removing the first element, but that’s more duct tape and spit over the other kludge. Not only do you have two special cases in the pattern, but you have a special case in the output.

You don’t have to remove the quotes right away though. You can reduce all the special cases by not matching the quote characters in the split pattern. You can use a lookaround to find the commas surrounded by quotes:

my $string = q("Buster","Roscoe, Cat","Mimi");

my @fields = split /(?<="),(?=")/, $string;

$" = "\n";
print "@fields\n";

The positive lookbehind, (?<=...), is a zero-width assertion. It matches a pattern that exists (hence positive) but doesn't consume the characters it matches. You already know about other zero-width assertions, such as \b and ^. These merely match a condition in the string before the pattern. The positive lookahead, (?<=...), is the same thing, but looks forward of the pattern.

Now all of the fields retain their quotes because the lookarounds do not consume the characters they match, even though they assert those characters must be there:

"Buster"
"Roscoe, Cat"
"Mimi"

You can easily strip off the quotes, handling every element returned by split in the same way:

use v5.14;
my $string = q("Buster","Roscoe, Cat","Mimi");

my @fields =
	map { s/\A"|"\Z//gr } 
	split /(?<="),(?=")/, $string;

$" = "\n";
print "@fields\n";

The pattern has no special cases, and the output from split has no special cases. Eliminating special cases reduces the number of things you have to remember and the reduces the likelihood that you'll mess up one of the cases.

Buster
Roscoe, Cat
Mimi

What if the separator where even more complex, with a literal quote mark inside the data? If you can do that, you can imagine a quote character next to a comma in the field:

"Buster","Roscoe "","" Cat","Mimi"

Now you want to split on a comma with quotes around it, but only if it doesn't have two consecutive quotes on either side. You can combine the positive lookarounds with negative lookarounds. The negative versions act the same, but assert that the condition cannot match, just like a \B asserts that the position is not a word boundary:

use v5.14;
my $string = q("Buster","Roscoe "","" Cat","Mimi");

my @fields =
	map { s/"(?=")//gr }
	map { s/\A"|"\z//gr }
	split /(?<!"")(?<="),(?=")(?!"")/, $string;

$" = "\n";
print "@fields\n";

In processing the "", you use another positive lookahead to unescape the doubled double quote character:

Buster
Roscoe "," Cat
Mimi

As a final example, instead of quoted fields, you might see the non-separator comma as an escaped character:

Buster,Roscoe\, Cat,Mimi

In this case, you only want to split on a comma that does not have an escape character before it. You can't use a positive lookbehind because you don't want to match characters before the comma. Instead, you want a negative lookbehind because you want to assert that there are characters that can't appear before the comma. Instead of a =, you use a !:

use v5.14;
my $string = q(Buster,Roscoe\\, Cat,Mimi);

my @fields =
	map { s/\\(?=,)//gr }
	split /(?<!\\),/, $string;

$" = "\n";
print "@fields\n";

Again, you use another positive lookahead, (?=,), in the s/// so you substitution pattern does not match the character that you don't want to replace. Otherwise, you'd have to type the comma twice:

s/\\,/,/gr

You can go even further with these examples, creating much more ugly and complex examples with additional levels of quoting. This should naturally lead you to believe that regular expressions aren't the best tool for this (or at least a single regular expression).

Things to remember

  • If you really have to parse comma-separated values, use a module instead of writing your own patterns
  • Lookarounds assert a condition in the string without consuming any characters
  • The positive lookarounds assert their patterns must match
  • The negative lookarounds assert their pattern must not match
  • Use the lookarounds to eliminate special cases in complex split patterns