Category Archives: Unicode

Know the difference between character strings and UTF-8 strings

Normally, you shouldn’t have to care about a string’s encoding. Indeed, the abstract string has no encoding. It exists as an idea without a representation and it’s not until you want to put it on disk, send it down a pipe, or otherwise force it to exist as electrical pulses, magnetic pole orientation, and so […]

Some special Unicode shell aliases to normalize strings

If you are playing with Unicode, you’re probably going to want to convert to the various normalization forms. There are some programs to do this in the Unicode::Tussle distribution, but you can also create some one-liners to do this as well (Item 120. Use Perl one-liners to create mini programs).

Fix Test::Builder’s Unicode issue

The perl interpreter is getting much better with its Unicode support, but that doesn’t mean everything just works because most of the code you probably are about is in modules, which might not have kept up. Some of this becomes apparent when you give another module some Unicode strings for it to output.

Be careful with Unicode character ranges

Unicode character ranges have the same gotchas as the ASCII character ranges, although they become more apparent and more important. You’re probably used to creating a range for all the letters, like the character classes [A-Z] or [a-z], the range ‘a’ .. ‘z’, or the range in a transliteration, and not having a problem. If […]

Treat Unicode strings as grapheme clusters

If you need to work with Unicode strings, you probably don’t want to use Perl’s built-in string manipulation functions. This might seem a strange thing to say about a lnaguage whose main feature is string processing, but it’s a consequence of Perl’s ease in string processing. Consider what a string is. Think of that for […]

The \R generic line ending

Perl v5.10 adds a regular expression shortcut \R that matches anything the Unicode specification thinks is a line ending. It looks similar to a character class shortcut but it’s not. It can match the sequence of carriage-return line-feed, but character classes don’t match sequence.

Know your sort orders

Once you leave the world of ASCII, things such as string comparisons and sorting get much tougher. In Effective Perl Programming, we devoted a short chapter to Unicode, but there’s a lot more that we could have covered. We mostly ignored the modern idea of locales and Unicode, but those have big effects on how […]

Know your character classes under different semantics

Now the Perl is Unicode aware (and, it’s been that way for a long time even if you haven’t been), you might have to be more careful in your regular expressions. Some of the character classes are much more inclusive than the ASCIIphile might imagine. In ASCIIland, character and byte semantics are the same thing. […]

Use the \N regex character class to get “not a newline”

Perl 5.12 introduced an experimental regex character class to stand in for every character except one, the newline. The \N character class is everything but the newline. In prior versions of Perl, this is the same thing as the . meta character. That is, it’s the same as long as someone doesn’t add the /s […]

Specify any character by its octal ordinal value.

Perl 5.14 gives you some new ways to represent characters so you can avoid some annoying and ambiguous interpolations. Not only that, the new syntax unifies the different ordinal representations so you can specify characters using the same syntax even if you want to use different bases. This feature was added in Perl 5.13.3, in […]