Use the > and < pack modifiers to specify the architecture

Byte-order modifiers are one of the Perl 5.10 features farther along in perl5100delta, after the really big features. To any pack format, you can append a < or a > to specify that the format is little-endian or big-endian, respectively. This allows you to handle endianness in the formats that don’t have specify versions for each architecture already, as well as apply endianness to groups.

Before you think about the < and > modifiers, consider those that already specify the endianness. The n and N formats specify an unsigned short or long in “network order”, which is big-endian. The v and V formats specify the same things, but in “VAX order”, which is little endian.

Here’s a test program which takes some bytes, which you specify in a string using the hex representation of each charater (just like pack would). Once you have the string, you use both N and V to unpack that, finding out which one works on your system. The L format always does it using the local architecture:

use 5.010;

my $string = "\xAA\xBB\xCC\xDD";

foreach my $format ( qw(N V) ) {
	my $number = unpack $format, $string;
	say sprintf "%s is 0x%X", $format, $number;
	say "Your native format is $format" if $number == pack 'L', $string;
	}

The output shows that the little-endian order switches the bytes around, and that this program ran on a little-endian machine (in this case, a MacBook Air, which uses Intel processors):

N is 0xAABBCCDD
V is 0xDDCCBBAA
Your native format is V

For those, you need to know which order you have, either by knowing the architecture or getting the producer of the data to tell you the format. For instance, UTF-16 text files can have a byte order mark, 0xFEFF; that’s a short integer (two bytes). If you are using a big endian machine, when you read that short you get 0xFEFF. If you are using a little endian machine, you get 0xFFFE because it switches the bytes around as you saw before.

The other pack formats use the native format so you haven’t had a way to specify which order to interpret the bytes. These formats have always used the native architecture (meaning they will get it wrong on the other architecture):

Format Description
s, S signed and unsigned shorts (two bytes)
i, I signed and unsigned integers (at least four bytes)
l, L signed and unsigned longs
q, Q signed and unsigned quads (if you have a 64-bit perl)
j, J signed and unsigned Perl internal integers
f single-precision floating-point value
d double-precision floating-point value
F Perl internal floating−point value
D long-double-precision floating-point value
p, P pointers to a null-terminated string and a structure

Perl 5.10 let’s you specify the architecture these formats should use. You can use big-endian values even if you are using a little-endian machine. Suppose you have π encoded as a single-precision floating point value in big-endian even though you have a little-endian machine. The native format

use 5.010;

my $pi_string = "\x40\x49\x0F\xDA"; # 3.14159250259399 in big-endians

foreach my $format ( qw(f f< f>) ) {
	my $number = unpack $format, $pi_string;
	say sprintf "%s is %f", $format, $number;
	}

The f and f< give the non-π results. The f assumes the native, little-endian format while the f< makes it explicit. The f> specifies big-endian format despite the native architecture, and it gets the right value (with normal floating-point rounding error):

f is -10082865224089600.000000
f< is -10082865224089600.000000
f> is 3.141593

You can also apply these modifiers to groups so that all of the modifiable formats in that group. This example tries combinations of unsigned shorts in either format:

use 5.010;

my $string = "\xAA\xBB\xCC\xDD";

foreach my $format ( qw| SS S<S> S>S< (SS)> (SS)< | ) {
	my( $first, $second ) = unpack $format, $string;
	say sprintf "%5s is 0x%X 0x%X", $format, $first, $second;
	}

The output shows you show the S format changes based on which architecture you tell pack to use:

   SS is 0xBBAA 0xDDCC
 S<S> is 0xBBAA 0xCCDD
 S>S< is 0xAABB 0xDDCC
(SS)> is 0xAABB 0xCCDD
(SS)< is 0xBBAA 0xDDCC

You still have to know which architecture your data are in, but at least you can tell Perl which format you want.

Things to remember

  • Most pack formats rely on the native architecture
  • Perl 5.10 introduces the < and > modifiers
    so you can specify the architecture

  • The < specifies little-endian because the little side touches the specifier
  • The > specifies big-endian because the big side touches the specifier