Know the difference between utf8 and UTF-8

Perl actually has two encodings that get the letters u, t, f, and 8. One will happily let you do bad things, and the other will let you do bad things but with a warning that you can make fatal.

There’s an encoding layer with the name :utf8 and there’s the encoding name UTF-8 that you use with :encoding:

binmode $fh, ':utf8';
binmode $fh, ':encoding(UTF-8)';

You can even use the non-hyphen version with :encoding:

binmode $fh, ':encoding(UTF8)';

These aren’t the same thing. The :utf8 layer comes from Perl 5.6, the first version of Perl that had even rudimentary Unicode support. It encodes any characters in the range from 0 to 0xFFFF_FFFF. That is, it allows for a 32-bit encoding space. You have no problem with this code:

use 5.014;
use strict;
use warnings;

my $string = "invalid -> \x{110000}";
my $output;

{
open my $string_fh, '>:utf8', \$output;
print $string_fh $string;
}

{
open my $string_fh, '<:raw', \$output;
my @values = map { sprintf '%X', ord } split //, readline( $string_fh );
say join ' ', @values;
}

This code writes to a string filehandle using the loose utf8 encoding and opens another read filehandle using the raw filehandle so you can see the bytes without any processing. The output shows the bytes in the output. The F4 90 80 80 represents the invalid character:

69 6E 76 61 6C 69 64 20 2D 3E 20 F4 90 80 80

Going the other way, reading in the file with the same encoding, doesn't cause any problems either.

use 5.014;
use strict;
use warnings;

my $string = "invalid -> \x{110000}";
my $output;

{
open my $string_fh, '>:utf8', \$output;
print $string_fh $string;
}

{
open my $string_fh, '<:utf8', \$output;
my @values = map { sprintf '%X', ord } split //, readline( $string_fh );
say join ' ', @values;
}

When you use the same layer to read the data, you get the same characters you started. Instead of F4 90 80 80 you get 110000:

69 6E 76 61 6C 69 64 20 2D 3E 20 110000

However, the Universal Character Set highest valid code number is 0x10FFFF, and even some of the characters inside that range aren't valid in UTF-8, such as the surrogates in the range 0xD800-DFFF, which you use to encode characters in the supplementary plane in UTF-16. If none of that makes sense, just remember that UTF-16 comes from the time when we thought the UCS would be a 16-bit encoding space and that two bytes would be enough for everyone (and how often has that not be true in history?). The "characters" in the surrogate range aren't characters. They are an ugly hack to let an ancient 16-bit system deal with a 21-bit system. You shouldn't be able to successfully read those characters.

use 5.014;
use strict;
use warnings;

my $string = "invalid -> \x{D800}";
my $output;

{
open my $string_fh, '>:utf8', \$output;
print $string_fh $string;
}

{
open my $string_fh, '<:utf8', \$output;
my @values = map { sprintf '%X', ord } split //, readline( $string_fh );
say join ' ', @values;
}

This code at least emits a warning:

Unicode surrogate U+D800 is illegal in UTF-8 at invalide.pl line 9.
69 6E 76 61 6C 69 64 20 2D 3E 20 D800

You only get this warning if you turn on warnings in Perls 5.10 and 5.12, but you get it even without warnings in Perl 5.14. But, it still works.

Try any of this with the actual UTF-8 encoding though, and odd things ensue:

use 5.010;
use strict;
use warnings;

my $string = "invalid -> \x{D800}";
my $output;

{
open my $string_fh, '>:utf8', \$output;
print $string_fh $string;
}

{
open my $string_fh, '<:encoding(UTF-8)', \$output;
my @values = map { sprintf '%X', ord } split //, readline( $string_fh );
say join ' ', @values;
}

The output gives two different warnings, and some odd output:

Unicode surrogate U+D800 is illegal in UTF-8 at invalide.pl line 10.
utf8 "\xD800" does not map to Unicode at invalide.pl line 15.
69 6E 76 61 6C 69 64 20 2D 3E 20 5C 78 7B 44 38 30 30 7D

That output is much longer than the previous output. Now you get 5C 78 7B 44 38 30 30 7D. If you know your code points, you'll recognize that as the literal characters \x{D800}.

You can convince yourself that this happens by creating the encoded string directly:

use 5.010;
use strict;
use warnings;

my $string = pack 'C*', map { hex } split /\s/, 
	'69 6E 76 61 6C 69 64 20 2D 3E 20 ED A0 80';
say $string;

open my $string_fh, '<:encoding(UTF-8)', \$string;
my $read = readline( $string_fh );
say $read;
my @values = map { sprintf '%X', ord } split //, $read;
say join ' ', @values;

You get the same output, still with a warning:

invalid -> ̆Ä
utf8 "\xD800" does not map to Unicode at invalide.pl line 10.
invalid -> \x{D800}
69 6E 76 61 6C 69 64 20 2D 3E 20 5C 78 7B 44 38 30 30 7D

This is a problem. The data you get aren't the data that are in the file. Writing the data with UTF-8 doesn't give a warning either:

use 5.010;
use strict;
use warnings;

my $string = "invalid -> \x{D800}";
my $output;

{
open my $string_fh, '>:encoding(UTF-8)', \$output;
print $string_fh $string;
}

{
open my $string_fh, '<:raw', \$output;
my @values = map { sprintf '%X', ord } split //, readline( $string_fh );
say join ' ', @values;
}

The output is:

"\x{d800}" does not map to utf8 at invalide.pl line 9.
69 6E 76 61 6C 69 64 20 2D 3E 20 5C 78 7B 44 38 30 30 7D

Huh? Perl will happily write the data, changing it on the way out. That's no good. Why is this happening?

There are several ways that Perl can deal with bad data as it encodes. That's not to say any of them are how Perl should deal with those data, but that's not the point. In this case, the Encode module is using its internal perlqq mode. When it finds an invalid character, it turns it into its code number and puts \x{} around it. If you were using the Encode module directly, you have control over those invalid characters.

use 5.010;
use strict;
use warnings;

use Encode qw(encode :fallbacks);

my $string = "invalid -> \x{D800}";

$string = encode( 'UTF-8', $string, FB_PERLQQ ); # what you already have

say 'The string is now[ ', $string, ']';

The output is what you got before (but without a warning because its handling is explicit):

The string is now[ invalid -> \x{D800}]

The other constants give different results:

Constant Effect String
FB_PERLQQ Replace with Perl entity Convert to \x{NNNN}
FB_XMLCREF Replace with XML entity Convert to &#xdddd;
FB_HTMLCREF Replace with HTML entity Convert to &#dddddd;
FB_DEFAULT Replace with the substitution character Convert to �
FB_CROAK Die
FB_QUIET Stop encoding, with no warning
FB_WARN Stop encoding, with a warning

You probably don't want to handle everything at that level in most cases, though. If you have invalid data, you need to fix that before it gets out to the world. You have the warning though. That means that you can make that operation fatal without going through Encode:

use warnings qw(FATAL utf8);

Things to remember

  • The :utf8 encoding, and variations on it without a hyphen, is Perl's looser encoding.
  • Using UTF-8, in any case and with either a hyphen or underscore, is the strict, valid encoding and gives a warning for invalid sequences.
  • Only use the :encoding(UTF-8) and make its warnings fatal.