Posted by brian d foy on September 11, 2011
Perl has had Unicode support since Perl 5.6, which means that most Perl tutorials have been bending the truth a bit when they tell you that a Perl identifier, the name that you give to variables, starts with [A-Za-z_] and continues with [0-9A-Za-z_]. With Unicode support, you have many more characters available to you, but [...]
Posted by brian d foy on August 28, 2011
Perl actually has two encodings that get the letters u, t, f, and 8. One will happily let you do bad things, and the other will let you do bad things but with a warning that you can make fatal. There’s an encoding layer with the name :utf8 and there’s the encoding name UTF-8 that [...]
Posted by brian d foy on August 21, 2011
Normally, you shouldn’t have to care about a string’s encoding. Indeed, the abstract string has no encoding. It exists as an idea without a representation and it’s not until you want to put it on disk, send it down a pipe, or otherwise force it to exist as electrical pulses, magnetic pole orientation, and so [...]
Posted by brian d foy on July 20, 2011
If you are playing with Unicode, you’re probably going to want to convert to the various normalization forms. There are some programs to do this in the Unicode::Tussle distribution, but you can also create some one-liners to do this as well (Item 120. Use Perl one-liners to create mini programs). If you want to read [...]
Posted by brian d foy on July 17, 2011
The perl interpreter is getting much better with its Unicode support, but that doesn’t mean everything just works because most of the code you probably are about is in modules, which might not have kept up. Some of this becomes apparent when you give another module some Unicode strings for it to output. For instance, [...]
Posted by brian d foy on July 10, 2011
Unicode character ranges have the same gotchas as the ASCII character ranges, although they become more apparent and more important. You’re probably used to creating a range for all the letters, like the character classes [A-Z] or [a-z], the range ‘a’ .. ‘z’, or the range in a transliteration, and not having a problem. If [...]
Posted by brian d foy on June 12, 2011
If you need to work with Unicode strings, you probably don’t want to use Perl’s built-in string manipulation functions. This might seem a strange thing to say about a lnaguage whose main feature is string processing, but it’s a consequence of Perl’s ease in string processing. Consider what a string is. Think of that for [...]
Posted by brian d foy on January 31, 2011
Once you leave the world of ASCII, things such as string comparisons and sorting get much tougher. In Effective Perl Programming, we devoted a short chapter to Unicode, but there’s a lot more that we could have covered. We mostly ignored the modern idea of locales and Unicode, but those have big effects on how [...]
Posted by brian d foy on January 16, 2011
Now the Perl is Unicode aware (and, it’s been that way for a long time even if you haven’t been), you might have to be more careful in your regular expressions. Some of the character classes are much more inclusive than the ASCIIphile might imagine. In ASCIIland, character and byte semantics are the same thing. [...]
Posted by brian d foy on January 9, 2011
Perl 5.12 introduced an experimental regex character class to stand in for every character except one, the newline. The \N character class is everything but the newline. In prior versions of Perl, this is the same thing as the . meta character. That is, it’s the same as long as someone doesn’t add the /s [...]