Perl v5.20 fixes taint problems with locale

Perl v5.20 fixes taint checking in regular expressions that might use the locale in its pattern, even if that part of the pattern isn’t a successful part of the match. The perlsec documentation has noted that taint-checking did that, but until v5.20, it didn’t.

The only approved way to untaint a variable is through a successful pattern match with captures:

my $tainted = ...;

$tainted_var =~ m/\A (\w+ ) \z/x;
my $untainted = $1;

The problem is \w. Which characters does that match? I discussed this is Know your character classes under different semantics although I didn’t focus on \w.

With use locale, settings from outside the program decide what \w matches. If you don’t set the locale yourself, someone is setting it for you, possibly letting character classes match what you didn’t intend (or even know about). That’s contrary to the spirit of the advice in perlsec:

you must be exceedingly careful with your patterns.

If something doesn’t have an knowable meaning and you use it, you aren’t being “exceedingly careful”. perlsec recommends using no locale to fix this. It’s better to have a better pattern that doesn’t go anywhere near locale issues.

I cover this is greater detail in the “Secure Programming Techniques” in Mastering Perl.

The perlsec example is exceeding non-careful now with recent versions of Perl. Here is the example from the v5.20 docs, which has been the example since at least v5.003 (released in 1996):

 if ($data =~ /^([-\@\w.]+)$/) {
	$data = $1; 			# $data now untainted
    } else {
	die "Bad data in '$data'"; 	# log this somewhere

You already know the problem with the \w. What are ^ and $? If you took my Learning Perl class, you know they are the beginning- and end-of-line anchors. Without any flags, the target string is one line. With the /m flag, they can match after or before a new line, respectively. They operate differently and you might not get the behavior that you want.

Prior to v5.14 (more correctly, the version of re that came with it), you had to put those flags with the qr// operator to compile the pattern or the operator that uses the pattern (see Know the difference between regex and match operator flags).

For v5.14 and later, the re module allows you to set default flags that apply to all patterns in its lexical scope (see Set default regular expression modifiers). The flags that might set something you didn’t intend to use and don’t show up near the code you care about. Those flags might not be there when you write the code, but someone adds them later if they have a fit of “modern Perl” or Perl Best Practices fever.

My guiding principle, though, it that any string which doesn’t exactly match what you expect isn’t safe enough to be untainted. If you mean the beginning or end of the absolute string, use the anchors that can only mean the beginning and end of string anchors, \A and \z (Item 35: Use zero-width assertions to match positions in a string).

Things to remember

  • The locale can change the meaning of character classes
  • Default regex flags change behavior from a distance
  • Don’t use character classes in regular expressions you use to untaint values
  • Use \A and \z for the absolute beginning
    and end of string
Leave a comment


  1. Is there something like Perl Critic that will help with security and taint coverage?

    It wasn’t immediately obvious to me where the “Item 35” reference comes from, and it’s not a link. (I’ve found it in indexes to EPP.)

Leave a Reply

[ Ctrl + Enter ]