Normalize your Perl source

Perl has had Unicode support since Perl 5.6, which means that most Perl tutorials have been bending the truth a bit when they tell you that a Perl identifier, the name that you give to variables, starts with [A-Za-z_] and continues with [0-9A-Za-z_]. With Unicode support, you have many more characters available to you, but moving outside the ASCII range has some problems. You can’t always tell what a variable name is just by looking at it (and this is a design bug in Perl: RT 96814). For instance, you don’t really don’t know what this variable is:

use utf8;

my $résumé = 'http://www.example.com/resume.html';

If you wanted to use that variable later in your program, what would you type? It seems simple, but Unicode has two ways to represent the é glyph. It has the composed version, (U+00E9 LATIN SMALL LETTER E WITH ACUTE), and the decomposed version of two characters, (U+0065 LATIN SMALL LETTER E) and (U+0301 COMBINING ACUTE ACCENT). Depending on your editor setup, you might not get the thing that you think that you typed, even.

None of this would be a problem is Perl normalized the variable names for you. Every time that you typed é, no matter how you created that glyph, you get the same representation in the source code. However, as of Perl 5.14, Perl does not do this for you. So, it’s a problem.

Consider how the next programmer knows what your variable name is? How many variables are in this script, and do you get a warning? What is the output of this simple program?

[It looks like that I can’t post this correctly because various things normalize it along the way!]

use utf8;
use 5.010;

my $é = 'abc';
my $é = '123';

$é = 'XYZ';

say "One char = ", $é;
say "Two char = ", $é;

Now, how about this program?

use utf8;
use 5.010;

my $é = 'abc';
my $é = '123';

$é = 'XYZ';

say "One char = ", $é;
say "Two char = ", $é;

There are two possible programs because there are two possible variables at line 7. You can’t tell just by looking at the source in your editor. Depending on which variable gets the XYZ assignment, you get different outputs:

One char = XYZ
Two char = abc
One char = 123
Two char = XYZ

There’s danger in this Item since you are reading it on the web and various things might have happened to the text as it made its way through databases and web servers and web browsers, any of which may have changed the source. Here’s the program that generates the two possible programs, depending on what time it is:

use 5.010;
use utf8;
use charnames qw(:full);

my $var = time % 2 ? 
	"e\N{COMBINING ACUTE ACCENT}"
	:
	"\N{LATIN SMALL LETTER E WITH ACUTE}"; 


binmode STDOUT, ':encoding(UTF-8)';
print <<"PERL";
use utf8;
use 5.010;

my \$e\N{COMBINING ACUTE ACCENT} = 'abc';
my \$\N{LATIN SMALL LETTER E WITH ACUTE} = '123';

\$$var = 'XYZ';

say "One char = ", \$\N{LATIN SMALL LETTER E WITH ACUTE};
say "Two char = ", \$e\N{COMBINING ACUTE ACCENT};
PERL

The source is encoded as UTF-8, but it's unnormalized, meaning that the different ways to represent the same glyph show up in different forms. If someone uses the form that you didn't, they actually use a different variable. Cutting and pasting may not even be safe because that process might normalize it one way or the other. Your editor may normalize it for you (but leaving other parts alone). You need the program to use the same normalization.

The simplest thing is making your editor handle it for you automatically, but if you can't do that, you might have to do it manually.

To change the normalization of a file, you can use the programs that come with Unicode::Tussle:

$ nfc program.pl > program-nfc.pl
$ nfd program.pl > program-nfc.pl

You could also make some Perl one-liners (in bash, in this case):

alias nfc="perl5.14.1 -MUnicode::Normalize -CS -ne 'print NFC(\$_)'"
alias nfd="perl5.14.1 -MUnicode::Normalize -CS -ne 'print NFD(\$_)'"

Beware, though. If you have parts of your file that need to be a particular normalization form, normalizing the entire file might change that. If you expect a string to be in NFD, perhaps to test a Unicode feature, changing the normalization will cause problems:

my $nfd_test_string = 'résumé'; # should be NFD.

However, if it's actually important for you to have a string in a particular form, you should enforce that explicitly instead of relying on the way you (or someone else) dealt with the file. You can force the normalization with Unicode::Normalize's subroutines:

use Unicode::Normalize qw(NFD);
my $nfd_test_string = NFD( 'résumé' ); # should be NFD.

Ideally, you'd handle this as part of your build process from your distribution directory so you don't have to think about it, but it's actually not simple to do that. There are two modules involved: ExtUtils::Install and ExtUtils::Manifest. The first copies files into blib in preparation for testing and installation. The second copies files listed in MANIFEST to a distribution directory. You want to be able to have the right version in both cases, but if you don't have normalized files to start you have some work to do. That's a bit beyond the scope of this Item (and a much longer discussion) that I might cover later.

Things to remember

  • Perl doesn't normalize variable names. It's a bug.
  • Normalize your Perl source one way or the other.
  • If you depend on a particular normalization in a string, force it explicitly.

One thought on “Normalize your Perl source”

  1. I wonder if using non-ascii variable names wouldn’t be more costly – in terms of less people being able to understand what your variable mean – than the gain you might get of being clearer for the local developers?

    Paamaim nekudotaim anyone?

Comments are closed.