Use /gc and \G in matches to separate alternations in separate, smaller patterns

Perl keeps track of the last position in a string where it had a successful global match (using the /g flag). You can access this position with the pos operator. With Perl 5.10, you can use the /p switch to get the per-match variable ${^MATCH} instead of the performance-dampening $&:

use 5.010;
my $string = 'The quick brown fox jumped over the lazy dog.';
$string =~ /quick/gp;

say "Matched [${^MATCH}] at ", pos( $string );

Since each string has its own last match position, you tell pos() which string you want to use. The output shows you what is matched and where it left off in the string:

Matched [quick] at 9

Devel::Peek shows you that Perl added some “magic” to $_ to keep track of the match position, which you see in the MG_LEN entry:

SV = PVMG(0x80c4ac) at 0x80f6a0
  REFCNT = 1
  FLAGS = (PADMY,SMG,POK,pPOK)
  IV = 0
  NV = 0
  PV = 0x207190 "The quick brown fox jumped over the lazy dog"\0
  CUR = 45
  LEN = 48
  MAGIC = 0x207d40
    MG_VIRTUAL = &PL_vtbl_mglob
    MG_TYPE = PERL_MAGIC_regex_global(g)
    MG_LEN = 9

Since you matched in scalar context, Perl made one match and stopped. The next global match picks up where the last one left off:

# continued for previous program
$string =~ /e/g;

say "Matched 'e' at ", pos( $string );

The output shows that you matched the first e after quick, not the first e in the string. The position from pos() is the character where the next match will start, so that’s one character after the last one that you matched:

Matched 'e' at 25

Usually when a global match fails, Perl resets the position to -1, which is right before the start of the string (position 0), so the next global match starts at the beginning of the string.

However, if you use the /c match operator flag, you can try a match without resetting the match position if it fails:

$string =~ /x/gc;
say "Did not match 'x' after " . pos( $string );

The output shows that the position of the match stayed the same as it was after you matched the e at position 24:

Did not match 'x' after 25

As with any other match operation, unless you anchor your regex, the match operator will move the pattern leftward across the string looking for a match. A global match has a special anchor, the \G, that constrains the next global pattern to start at the current match position. If you wanted to check if there is whitespace right after the match position, you’d anchor the next part of the pattern with \G:

if( $string =~ /\G\s+/gc ) {
     say "Matched whitespace right after position " . pos( $string );
     }

Now that you know that you can try a global match without resetting the match position, you can try several different patterns to find the one that works for the next part of your string. If one match fails, you simply move an to another. This is exteremely useful because you don’t have to construct a single pattern to match all possible cases, which would likely lead to long chains of ugly alternations.

Suppose, for instance, that you wanted to write a simple sentence tokenizer. For actual work, this is really hard problem, but ignore that. Define several regular expressions and give them names and store those in @tuples as pairs. Remember that /c is match operator flags and not a pattern flag, so you don’t use it with qr//:

use 5.010;

my @tuples = (
	[ qr/\G\v+/,          'vspace'      ],
	[ qr/\G\h+/,          'hspace'      ],
	[ qr/\G[[:punct:]]+/, 'punctuation' ],
	[ qr/\G(dog|fox)\b/i, 'animal'      ],
	[ qr/\G[a-z]+/i,      'letters'     ],
	);

Then, loop through each pair until you find one that matches. If you find a match, push the matched portion onto @tokens and restart the loop with redo. When you restart the loop, you’ll go through all the patterns again but at the new match position. Eventually, you’ll go through all of the patterns without finding a match, either because you don’t recognize part of the string or you reached the end of the string. Once you’ve tried every pattern unsuccessfully, you merely exit the loop.

If you wanted to tokenize the sentence from the previous example, that process would look something like this:

use 5.010;

my $string = 'The quick brown fox jumped over the lazy dog.';

LOOP:
	{
	TUPLE: foreach my $tuple ( @tuples )
		{
		my( $pattern, $type ) = @$tuple;
		next TUPLE unless $string =~ m/$pattern/pgc;
		push @tokens, [ ${^MATCH}, $type ];
		redo LOOP;
		}
	}

foreach my $token ( @tokens )
	{
	printf "%-10s %-10s\n", @$token;
	}

Notice how the tokenization code does care about the number of patterns. You can add as many patterns as you like without touching that bit of code. You merely have to get the patterns in the right order.

The output shows that you’ve broken the string into tokens, recognizing the various pieces of the sentence as you described them (although you don’t “see” the whitespace):

The        letters   
           hspace    
quick      letters   
           hspace    
brown      letters   
           hspace    
fox        animal   
           hspace    
jumped     letters   
           hspace    
over       letters   
           hspace    
the        letters   
           hspace    
lazy       letters   
           hspace    
dog        animal   
.          punctuation

Usually, the next step after (or even during) tokenization is to figure out how the tokens relate to and operate on each other. That’s a story for a different item, though.

Things to remember

  • In scalar context, a global match starts where you left off in that string
  • The /c prevents the match operator from resetting the match position as a failure
  • Using /gc allows you to separate alternations in separate regexes instead of one big and ugly regex
Leave a comment

0 Comments.

Leave a Reply


[ Ctrl + Enter ]