Match Unicode characters by property value

A Unicode character has properties; it knows things about itself. Perl v5.10 introduced a way to match a character that has certain properties that v5.10 supports. In some cases you can match a particular property value. Now v5.12 allows you can match any Unicode property by its value. The newly-supported ones include Numeric_Value and Age, for example:

\p{Numeric_Value: 1}
\p{Nv=7}
\p{Age: 3.0}

Continue reading “Match Unicode characters by property value”

Watch out for disappearing strings when you decode

In the Effective Perl class I gave at Frozen Perl last week, I got a question I didn’t have the quick answer to. What happens to the strings when Encode's decode function only partially decodes the string?

The default behavior for decode always decodes the entire string, although it uses substitution character (0xFFFD, which may look like ? on the screen) anywhere that it finds an error in the encoding:

You can change how decode handles problems by supplying a third argument to it, using one of the constants FB_DEFAULT, FB_CROAK, FB_WARN, or FB_QUIET. The FB_DEFAULT uses the substitution character and the FB_CROAK just dies. It’s the other two that are interesting. They stop decoding, either with a warning or without one. Try it yourself:

use 5.010;
use strict;
use warnings;
use Encode qw(decode :fallbacks);

binmode STDOUT, ":utf8";

foreach my $fallback ( qw( FB_DEFAULT FB_CROAK FB_WARN FB_QUIET ) )
	{
	my $fallback_value =  do { no strict 'refs'; &{"$fallback"} };
	
	my $octets  = do { use bytes; "\x41\x42\x43\x61\xCC\x61\x41\x42\x43" };
	my $decoded = eval { decode( 'utf8', $octets, $fallback_value ) };
	say "$fallback: ", show_chars( $decoded ), " [$octets]";
	}


sub show_chars {
	use bytes;
	defined $_[0] ?
		join( ':', map { sprintf "%X", ord } split //, $_[0] )
			:
		'undefined';
	}

The string you’re using is "\x41\x42\x43\x61\xCC\x61\x41\x42\x43". It’s “ABCa.aABC” where that "\xCC" in the middle is an error. It’s the starting a combining character but it doesn’t have a valid octet following it. When you print it, it looks a bit odd (ABCaÃŒaABC) because Perl is treating it as bytes since you used use bytes; in the scope that you created it.

The output shows the fallback type, the characters (in hex separated by colons), and in the braces, the value of $octets after the operation:

FB_DEFAULT: 41:42:43:61:FFFD:61:41:42:43 [ABCaÃŒaABC]
FB_CROAK: undefined [ABCaÃŒaABC]
FB_WARN: 41:42:43:61 [ÃŒaABC]
FB_QUIET: 41:42:43:61 [ÃŒaABC]
utf8 "\xCC" does not map to Unicode at ...

In the FB_DEFAULT case, the \xCC turned into the substitution character, \xFFFD. Notice that the split // worked on characters, so the two-byte substitution character has four letters in the hex representation.

In the FB_CROAK case, the decode dies, the return value is undef, and $octets stays the same. decode doesn’t mess with the argument at all.

Both FB_WARN and FB_QUIET do the same thing, although FB_WARN whines about it. They each tell decode to handle as much of the string as it can. When it finds an error, it returns what it had so far (represented by 41:42:43:61, which is ABCa). However, it also removes that part from the input string, leaving only the part of the string from the error onward. This gives you a chance to examine the string where decode left off so you can decide what to do on your own. You might take off offending bits and start the processing again.

It’s documented that decode changes its input, but not right next to the main documentation for that function. You have to read the “Handling Malformed Data” section later in the Encode docs.

You might notice the problem if you try to decode a string literal:

use Encode qw(decode :fallbacks);
my $decoded = decode( 'utf8', "\x61\xCC\x61", FB_WARN );

You get the error about modifying a read-only value:

Modification of a read-only value attempted ...

If you don’t want decode to mess with your argument, you can use a bitmask to adjust the fallback value. decode looks for the LEAVE_SRC bit to be set (and it only matters for FB_WARN and FB_QUIET), so just OR it away:

use Encode qw(decode :fallbacks LEAVE_SRC);
my $decoded = decode( 'utf8', "\x61\xCC\x61", FB_WARN | LEAVE_SRC );

If you want to keep the original octet sequence, save a copy before you pass it to decode.