Use lookarounds to split to avoid special cases

There are some regular expression tricks that can help you deal with balanced delimiters in a string. The split command takes a pattern, removes the parts of a string that match that pattern, and give you a list of the parts of the string between those separators. Said another way, split works when the parts you don’t need are between the values.

Single character separators are easy:

use v5.10;

my @letters = split /:/, 'a:b:c:d:e';
say "@letters";

The list comes out just as you expect:

a b c d e

Even multiple or variable width patterns are fine:

use v5.10;

my @cats = split /\s+/, 'Buster 
	Mimi     Roscoe';
say "@cats";

The list comes out just as you expect:

Buster Mimi Roscoe

It gets more tricky when you have balanced delimiters, when there’s something that marks the start and the end of a value. The problem is that there is something in front of the first element and something after the last element. You can’t split on the pattern of characters between the values because you don’t remove everything:

use v5.10;

my @cats = split /></, '<Buster><Mimi><Roscoe>';
say "@cats";

The first and last delimiter characters are still attached to their values:

<Buster Mimi Roscoe>

You might be tempted to live with that and process those values after the split:

use v5.10;

my @cats = split /></, '<Buster><Mimi><Roscoe>';
$cats[0] =~ s/<//;
$cats[-1] =~ s/>//;
say "@cats";

Some people might be satisfied with that, and it does work, but it’s much better to remove the special cases. If you limit yourself to matching just the character that you want to remove, you’re a bit limited. One problem is the empty leading field that you get if you try to match the first delimiter character:

use v5.10;

my @cats = split /><|\A<|>\z/, '<Buster><Mimi><Roscoe>';

say "@cats";

There’s a space at the beginning of the output because there’s an empty leading field, but the list at least doesn’t have any of the delimiter characters:

 Buster Mimi Roscoe

To fix this, you still need to handle the leading field, perhaps by shifting it off. Again, this works, even if it’s unsightly:

use v5.10;

my @cats = split /><|\A<|>\z/, '<Buster><Mimi><Roscoe>';
shift @cats;

say "@cats";

The special processing isn’t as bad, but you have to remember to handle that one element.

Instead of matching characters, you can use lookarounds to split on the the middle of the balanced delimiter by using a zero-width assertion. The lookarounds match a condition in the string but do not consume any characters. These are conditions in the string, not characters to match.

If you use a lookbehind next to a lookahead, you can split on the position in the string where both conditions match. You want to match in the middle of a >< so the > ends up with the preceding element and the < stays with the succeeding element.

The positive lookbehind has the general form (?<=PATTERN). That pattern, which must be fixed-width, must match before the position. In this case, you want to match a > before the position, so the assertion is (?<=>).

The positive lookahead is almost the same thing, with the form (?=PATTERN). You want to match a < after the position, so your assertion is (?=<).

Putting them together, the lookbehind next to the lookahead, splits the values:

use v5.10;

my @cats = split /(?<=>)(?=<)/, '<Buster><Mimi><Roscoe>';

say "@cats";

The output list still has the delimiter characters, but now each element needs the same processing, so there are no special cases:

<Buster> <Mimi> <Roscoe>

Once you have the values in their own elements, you can remove the delimiters:

use v5.14;

my @cats = 
	map { s/\A<|>\z//rg }    # return the modified value
	split /(?<=>)(?=<)/, 
	'<Buster><Mimi><Roscoe>';

say "@cats";

That might seem a bit silly, but we’re only using a simple example to illustrate the point.

Consider a slightly more complicated case, where the fields are quoted, but then separated by commas. Unless your learning to re-invent the wheel (a valid exercise to sharpen your skills), you should probably use a module (Item 115. Don’t use regular expressions for comma-separated values). For this example, you’ll do it yourself:

use v5.10;

my @cats = 
	split /(?<="),(?=")/, 
	'"Buster","Mimi","Roscoe"';

say "@cats";

This removes the commas, as long as they are between quotes. However, you leave the quotes in place so you don't treat the first and last values specially:

"Buster" "Mimi" "Roscoe"

To get rid of the quotes, you process each item in the same way:

use v5.14;

my @cats = 
	map { s/\A"|"\z//rg }       # return the modified value
	split /(?<="),(?=")/, 
	'"Buster","Mimi","Roscoe"';

say "@cats";

You might try to construct a more complicated regular expression to also remove the quotes, but that's going to be harder to read and maintain than doing it in two simple steps.

Things to remember

  • You don't have to remove delimiters in one step
  • You can use a lookbehind next to a lookahead to specify a position in a string