Category Archives: Unicode

Know your character classes under different semantics

Now the Perl is Unicode aware (and, it’s been that way for a long time even if you haven’t been), you might have to be more careful in your regular expressions. Some of the character classes are much more inclusive than the ASCIIphile might imagine. In ASCIIland, character and byte semantics are the same thing. […]

Use the \N regex character class to get “not a newline”

Perl 5.12 introduced an experimental regex character class to stand in for every character except one, the newline. The \N character class is everything but the newline. In prior versions of Perl, this is the same thing as the . meta character. That is, it’s the same as long as someone doesn’t add the /s […]

Specify any character by its octal ordinal value.

Perl 5.14 gives you some new ways to represent characters so you can avoid some annoying and ambiguous interpolations. Not only that, the new syntax unifies the different ordinal representations so you can specify characters using the same syntax even if you want to use different bases. This feature was added in Perl 5.13.3, in […]

Slides for “Effective Perl: Unicode” at Frozen Perl 2010

At Frozen Perl I did a quick presentation about Unicode and Perl. I had to do some work on the slides before releasing them publicly, but here they are… Be sure to look at the author notes if you want more detailed information.

Watch out for disappearing strings when you decode

In the Effective Perl class I gave at Frozen Perl last week, I got a question I didn’t have the quick answer to. What happens to the strings when Encode’s decode function only partially decodes the string? The default behavior for decode always decodes the entire string, although it uses substitution character (0xFFFD, which may […]