Know the difference between character strings and UTF-8 strings

Normally, you shouldn’t have to care about a string’s encoding. Indeed, the abstract string has no encoding. It exists as an idea without a representation and it’s not until you want to put it on disk, send it down a pipe, or otherwise force it to exist as electrical pulses, magnetic pole orientation, and so on that you need to think of it in concrete terms. All stored data, even ASCII, has an encoding. Until you force it to have a bit pattern to live in the tangible world, you shouldn’t have to worry about anything like an encoding. Continue reading “Know the difference between character strings and UTF-8 strings”

Some special Unicode shell aliases to normalize strings

If you are playing with Unicode, you’re probably going to want to convert to the various normalization forms. There are some programs to do this in the Unicode::Tussle distribution, but you can also create some one-liners to do this as well (Item 120. Use Perl one-liners to create mini programs). Continue reading “Some special Unicode shell aliases to normalize strings”

Fix Test::Builder’s Unicode issue

The perl interpreter is getting much better with its Unicode support, but that doesn’t mean everything just works because most of the code you probably are about is in modules, which might not have kept up. Some of this becomes apparent when you give another module some Unicode strings for it to output. Continue reading “Fix Test::Builder’s Unicode issue”

Be careful with Unicode character ranges

Unicode character ranges have the same gotchas as the ASCII character ranges, although they become more apparent and more important. You’re probably used to creating a range for all the letters, like the character classes [A-Z] or [a-z], the range 'a' .. 'z', or the range in a transliteration, and not having a problem. If you look at the ASCII sequence, you see that there is an unbroken series of letters in those ranges. Continue reading “Be careful with Unicode character ranges”

Treat Unicode strings as grapheme clusters

If you need to work with Unicode strings, you probably don’t want to use Perl’s built-in string manipulation functions. This might seem a strange thing to say about a lnaguage whose main feature is string processing, but it’s a consequence of Perl’s ease in string processing.

Consider what a string is. Think of that for a moment. Write out your definition if you need to. Now, what is a string in Perl? Does it match your definition? Continue reading “Treat Unicode strings as grapheme clusters”

The \R generic line ending

Perl v5.10 adds a regular expression shortcut \R that matches anything the Unicode specification thinks is a line ending. It looks similar to a character class shortcut but it’s not. It can match the sequence of carriage-return line-feed, but character classes don’t match sequence. Continue reading “The \R generic line ending”

Know your sort orders

Once you leave the world of ASCII, things such as string comparisons and sorting get much tougher. In Effective Perl Programming, we devoted a short chapter to Unicode, but there’s a lot more that we could have covered. We mostly ignored the modern idea of locales and Unicode, but those have big effects on how Perl compares characters, and thus, how it orders them with sort. Continue reading “Know your sort orders”

Know your character classes under different semantics

Now the Perl is Unicode aware (and, it’s been that way for a long time even if you haven’t been), you might have to be more careful in your regular expressions. Some of the character classes are much more inclusive than the ASCIIphile might imagine. In ASCIIland, character and byte semantics are the same thing. No matter which way you treat your strings you get the same answer. With Unicode, however, Perl might now treat certain sequences of bytes as one character. The character and byte semantics have diverged. If you let Perl treat your data as character data when it really isn’t, you can run into problems. If you aren’t already doing something special, you’re probably using character semantics. Continue reading “Know your character classes under different semantics”

Use the \N regex character class to get “not a newline”

Perl 5.12 introduced an experimental regex character class to stand in for every character except one, the newline. The \N character class is everything but the newline.

In prior versions of Perl, this is the same thing as the . meta character. That is, it’s the same as long as someone doesn’t add the /s to the match or substitution operator or the regex quoting operator, or doesn’t insert the option inside the pattern: Continue reading “Use the \N regex character class to get “not a newline””

Specify any character by its octal ordinal value.

Perl 5.14 gives you some new ways to represent characters so you can avoid some annoying and ambiguous interpolations. Not only that, the new syntax unifies the different ordinal representations so you can specify characters using the same syntax even if you want to use different bases. This feature was added in Perl 5.13.3, in the development branch leading to the next stable version. Continue reading “Specify any character by its octal ordinal value.”