Know the difference between character strings and UTF-8 strings

Normally, you shouldn’t have to care about a string’s encoding. Indeed, the abstract string has no encoding. It exists as an idea without a representation and it’s not until you want to put it on disk, send it down a pipe, or otherwise force it to exist as electrical pulses, magnetic pole orientation, and so on. All stored data, even ASCII, has an encoding. Until you force it to have a bit pattern to live in the tangible world, you shouldn’t have to worry about anything like an encoding.

An abstract character string is one where Perl can recognize each grapheme cluster as a unit, and there is no encoding involved at the user level. Perl has to store these, but you don’t (shouldn’t) play with the string at that level.

A UTF-8–encoded string is one where the octets in the string are the same as in the UTF-8 representation. Perl sees a string of octets and cannot recognize grapheme clusters.

Consider this example. In

use v5.14;
use utf8;

# # # Abstract character string
my $char_string = 'Büster';

say "Length of char string is ", length $char_string; #6
say join " ", map { sprintf '%X', ord } split //, $char_string;

# # # UTF-8–encoded octet string
open my $fh, '>:utf8', \my $utf8_string;
print $fh $char_string;
close $fh;

say "Length of utf8 string is ", length $utf8_string; # 7
say join " ", map { sprintf '%X', ord } split //, $utf8_string;

The output shows that the same are two are different things because one is a string of characters and one a string of octets:

use v5.14;
use utf8;

# # # Abstract character string
my $char_string = 'Büster';

say "Length of char string is ", length $char_string; #6
say join " ", map { sprintf '%X', ord } split //, $char_string;

# # # UTF-8–encoded octet string
open my $fh, '>:utf8', \my $utf8_string;
print $fh $char_string;
close $fh;

say "Length of utf8 string is ", length $utf8_string; # 7
say join " ", map { sprintf '%X', ord } split //, $utf8_string;

The output shows the difference. In the character string, the ü shows up as the single character with code number 0xFC. In the UTF-8 version, the code number 0xFC is represented as 0xC3 0xBC. Since this is just a string of octets, Perl thinks that this version is one character longer:

Length of char string is 6
42 FC 73 74 65 72
Length of utf8 string is 7
42 C3 BC 73 74 65 72

For most of your programming, you shouldn’t have to care about encoding. You want to have character data with no representation and operate on abstract characters. You don’t care at all about the encoding and how many bytes a character turns into. That’s merely a storage issue. Virtually no one can tell you, off the top of their heads, what the UTF-8 representation of a string is because no one thinks in UTF-8. No one wants to do that during string manipulation, either.

The problem is that some interfaces want the encoded data instead of the abstract character string. These modules usually expect that you’re giving it data directly from another source without turning it into a Perl string. If you need to review these concepts, check out the “Unicode” chapter in Effective Perl Programming.

Consider the JSON module’s decode function expects a UTF-8–encoded string, thinking you’re going to take it directly from an HTTP response. This item is not about using this module correctly, but it’s a convenient example for the general idea.

This works just fine because the value in $json_data is a UTF-8–encoded string instead of a abstract character string:

use JSON;
use LWP::Simple qw(get);

my $json_data = get( 'http://www.example.com/data.json' );

my $perl_hash = decode_json( $json_data );

The decode_json doesn’t expect you to do anything with the data that you get from the website before you give it to decode_json, who’s job it is to both decode the data and to convert the data from JSON to Perl. It’s documented this way. Instead of making you decode it in the response, it uses the data just as you would get it in the message body of the HTTP response.

If you are doing extra processing, however, you can get in trouble. For instance, the HTTP::Response object can decode the message body for you, turning UTF-8 data into an abstract character string. If you call decoded_content and pass the result to decode_json, it fails:

use Encode;
use JSON;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

my $response = $ua->get( 'http://www252.pair.com/~comdog/for/data.json' );

my $content = $response->content;
print "Length content is ", length $content, "\n";

my $decoded_content = $response->decoded_content;
print "Length decoded content is ", length $decoded_content, "\n";

# this is fine
my $perl_hash = decode_json( $content );

# this is not fine
my $decoded_hash = decode_json( $decoded_content );

If you have your input string as an abstract character string, the decode method might fail. If it’s all characters in the ASCII range, it doesn’t matter because the UTF-8 representation is the same as the ASCII representation:

use utf8;
use JSON;

my $json_data = q( { "cat" : "Buster" } );

my $perl_hash = decode_json( $json_data );

Give it something outside the ASCII range, and things go wrong:

use utf8;
use JSON;

my $json_data = qq( { "cat" : "Büster" } );

my $perl_hash = decode_json( $json_data );

The error says it has a malformed UTF-8 character. In an abstract character string, the ü is 0xFC, which isn’t a valid UTF-8 sequence:

malformed UTF-8 character in JSON string, at character offset 13 (before "\x{33d25ca2} } ") at string.pl line 6.

In this case, you need to turn your abstract character string into a UTF-8–encoded string, just like it would look as if you had stored it in a file. You can encode it (going from the abstract character string to the UTF-8 version) with the Encode module (Item 75. Convert octet strings to character strings.):

use utf8;
use Encode qw(encode_utf8);
use JSON;

my $json_data = qq( { "cat" : "Büster" } );
$json_data = encode_utf8( $json_data );

my $perl_hash = decode_json( $json_data );

You can also print to a scalar reference, using the encoding that you need (Item 54. Open filehandles to and from strings):

use utf8;
use Encode qw(encode_utf8);
use JSON;

my $json_data = qq( { "cat" : "Büster" } );
open my $fh, '>:utf8', \my $utf8_string;
print $fh $json_data;

my $perl_hash = decode_json( $utf8_string );

If you already have the text in a file and need it un-decoded, you can read it with the :raw layer so perl does not decode it (possibly with default layers set far away):

use Encode qw(encode_utf8);
use JSON qw(decode);

open my $fh, '<:raw', $file;
my $json_data = do { local $/; <$fh> };

my $perl_hash = decode_json( $utf8_string );

Doing it differently in JSON

You don’t have to use JSON‘s decode_json function. Using the object interface, you can tell the decoder what you’re giving it. If you want to give it a UTF-8–encoded string, you tell it to expect UTF-8:

use JSON;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

my $response = $ua->get( 'http://www.example.com/data.json' );

my $content = $response->content;

my $perl_hash = JSON->new->utf8->decode( $content );

If you want to give it character data, you don’t tell the object to expect UTF-8:

use JSON;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

my $response = $ua->get( 'http://www.example.com/data.json' );

my $content = $response->decoded_content;

my $perl_hash = JSON->new->decode( $decoded_content ); # no ->utf8

Things to remember

  • Character string have no encoding, and Perl can recognize its grapheme clusters
  • An encoded string is a series of octets that Perl doesn’t recognize as grapheme clusters
  • Check your interface to see which one you should use

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Leave a comment

0 Comments.

Leave a Reply

You must be logged in to post a comment.