Experimental features now warn (reaching back to v5.10)

Perl 5.18 provides a new way to introduce experimental features in a program, augmenting the feature pragma that v5.10 added. This change marks certain broken v5.10 features as experimental with an eye toward possible removal from the language.

Smart matching in v5.10 led to several broken and conflated features. The given used a lexical version of $_, which broke many other common uses of that variable inside the given, which I explain in Use for() instead of given() and you can see in given/when and lexical $_ ….

Under v5.18, when you use given, when, or ~~, you get a warning, even if there is no smart match involved:

# given_warning.pl
use v5.10; # earliest occurance of feature
for( 'Buster' ) {
	when( 1 == 1 ) { say "Hello" }
	}

These warnings might cause test suites to fail when people try to install modules on the new perl, like it does for Unicode::Tussle.

% perl5.10.1 given_warning.pl
Hello
% perl5.18.0 given_warning.pl
when is experimental at given_warning.pl line 4.
Hello

Using the diagnostics shows the sort of warning it is:

% perl5.18.0 -Mdiagnostics given_warning.pl
when is experimental at -e line 1 (#1)
    (S experimental::smartmatch) when depends on smartmatch, which is
    experimental.  Additionally, it has several special cases that may
    not be immediately obvious, and their behavior may change or
    even be removed in any future release of perl.
    See the explanation under "Experimental Details on given and when"
    in perlsyn.

Hello

To get rid of this warning, you do the same thing you do with other warnings. Take the category of the warning and turn it off with no (Item 100: Use lexical warnings to selectively turn on or off complaints):

# given_warning.pl
use v5.10; # earliest occurance of feature
no warnings 'experimental::smartmatch';
for( 'Buster' ) {
	when( 1 == 1 ) { say "Hello" }
	}

The lexical $_ is another broken fature that’s now marked as experimental.

# lexical_.pl
use v5.10;

sub cat { my $_ }

Any use in v5.18 gives a warning:

% perl5.18.0 lexical_.pl
Use of my $_ is experimental at lexcial_.pl line 3.

The category is different:

% perl5.18.0 -Mdiagnostics lexical_.pl
Use of my $_ is experimental at lexcial_.pl line 4 (#1)
    (S experimental::lexical_topic) Lexical $_ is an experimental
    feature and its behavior may change or even be removed in any
    future release of perl. See the explanation under "$_" in perlvar.

That takes care of the two retro features. Perl v5.18 introduces two new experimental features, set logic in character classes (for complete Unicode Level 1 regular expression compliance), and lexical subroutines, which I’ll cover in other items.

# regex.pl
use v5.18;

print "Match" if 'foo' =~ /(?[ \p{Thai} & \p{Digit} ])/;

Without turning off the warning, perl knows about the feature and points it out:

% perl5.18.0 regex.pl
The regex_sets feature is experimental in regex; marked by <-- HERE in m/(?[ <-- HERE  \p{Thai} & \p{Digit} ])/ at regex.pl line 4.

In this case, diagnostics is not any help:

% perl5.18.0 -Mdiagnostics regex.pl
The regex_sets feature is experimental in regex; marked by <-- HERE in m/(?[
        <-- HERE  \p{Thai} & \p{Digit} ])/ at regex.pl line 3 (#1)
The regex_sets feature is experimental in regex; marked by <-- HERE in m/(?[ <-- HERE  \p{Thai} & \p{Digit} ])/ at regex.pl line 3.

For lexical named subroutines, you have explicitly enable the feature but you then have to explicitly turn off its warnings.

# lexical_sub.pl
use v5.18;
no warnings 'experimental::lexical_subs';
use feature "lexical_subs";

my sub foo { say "Hello" }

Handling older perls

In v5.18, that's all fine and good, but older versions don't understand those warnings categories and will stop your program.

% perl5.10.1 -e 'no warnings qw(smartmatch)'
Unknown warnings category 'smartmatch' at -e line 1
BEGIN failed--compilation aborted at -e line 1.

Instead of using warnings, you can use the non-core experimental module that handles that for you:

use experimental qw(smartmatch);

For versions without that warning category, nothing happens. For versions with that feature, it turns off the warning.

Summary

This table summarizes the new experimental warnings categories and the features they affect.

Category Features
experimental::smartmatch given, when, ~~
experimental::lexical_topic my $_
experimental::regex_sets (?[ ])
experimental::lexical_subs my sub NAME {}, our sub NAME {}

Things to remember

  • Some v5.10 features now warn under v5.18
  • Some new experimental features must be explicitly enabled
  • Even explicitly enabled features still warn
  • The experimental module is version safe

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Perl 5.18 new features

Perl 5.18 is out and there are some major changes that you should know about before you upgrade. Most notably, some features from v5.10 are now marked experimental. If you use those features, you get warnings.


Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

The vertical tab is part of \s in Perl 5.18

Up to v5.18, the vertical tab wasn’t part of the \s character class shortcut for ASCII whitespace. No one really knows why. It was curious trivia that I pointed out in Know your character classes under different semantics. Whitespace in ASCII, POSIX, and Unicode represented different sets. Perl whitespace was different from POSIX whitespace by only the exclusion of the vertical tab. Now that little oversight is fixed.

I had this program to mark which sets matched which characters. I required v5.10 because that’s the first appearance of the \h and \v shortcuts for horizontal and vertical whitespace.

use 5.010;

use charnames qw(:full);

print <<"LEGEND";
s   matches \\s, matches Perl whitespace
h   matches \\h, horizontal whitespace
v   matches \\v, vertical whitespace
p   matches [[:space:]], POSIX whitespace
all characters match Unicode whitespace, \\p{Space}

LEGEND

printf qq(%s %s %s %s  %-7s --> %s\n),
	qw( s h v p  Ordinal  Name );
print '-' x 50, "\n";

foreach my $ord ( 0 .. 0x10ffff ) {
	next unless chr($ord) =~ /\p{Space}/;
	my( $s, $h, $v, $posix ) =
		map { chr($ord) =~ m/$_/ ? 'x' : ' ' }
			( qr/\s/, qr/\h/, qr/\v/, qr/[[:space:]]/ );
	printf qq(%s %s %s %s  0x%04X  --> %s\n),
		$s, $h, $v, $posix,
		$ord, charnames::viacode($ord);
	}

Under v5.10, the top of the output showed that \s did not include the vertical tab, which the UCS names LINE TABULATION.

$ perl5.10.1 spaces
s   matches \s, matches Perl whitespace
h   matches \h, horizontal whitespace
v   matches \v, vertical whitespace
p   matches [[:space:]], POSIX whitespace
all characters match Unicode whitespace, \p{Space}

s h v p  Ordinal --> Name
--------------------------------------------------
x x   x  0x0009  --> CHARACTER TABULATION
x   x x  0x000A  --> LINE FEED
    x x  0x000B  --> LINE TABULATION
x   x x  0x000C  --> FORM FEED
x   x x  0x000D  --> CARRIAGE RETURN
x x   x  0x0020  --> SPACE

Run under v5.18, the output changes slightly to have another x in the third row (line 12).

$ perl5.18.0 spaces
s   matches \s, matches Perl whitespace
h   matches \h, horizontal whitespace
v   matches \v, vertical whitespace
p   matches [[:space:]], POSIX whitespace
all characters match Unicode whitespace, \p{Space}

s h v p  Ordinal --> Name
--------------------------------------------------
x x   x  0x0009  --> CHARACTER TABULATION
x   x x  0x000A  --> LINE FEED
x   x x  0x000B  --> LINE TABULATION
x   x x  0x000C  --> FORM FEED
x   x x  0x000D  --> CARRIAGE RETURN
x x   x  0x0020  --> SPACE

I don’t foresee this breaking anything since the vertical tab seems to be a rare character, although in ETL I liked using it as a separator since I figured no one else would be using it.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Effective Perler discounts during OSCON

I’ll be at OSCON on Tuesday, July 17, but you don’t have to find me to get up to 37% off Effective Perl Programming. That’s a slightly lower price than Amazon. To get that discount, you have to buy the book at Pearson’s booth in the exhibition hall. You’ll need to track me down on Tuesday afternoon or evening if you want me to sign your book.

If you can’t make it to OSCON, you can still get 35% off the cover price by ordering directly from the InformIT discount link or using the OSCON2012 discount code when you check out. Instead of navigating their site, you can go directly to our book.

If you’re not sure you want the book, you can look at a free sample chapter, which is also 35% off during OSCON.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Declare your Pod encoding

Pod::Simple 3.21 changed its behavior when it encountered a non-ASCII character in Pod without an encoding. Instead of handling it quietly, it now gives a warning. That’s not so bad, but Test::Pod uses Pod::Simple, and whenever it sees a warning, pod_ok fails, as it did in my Mac::Errors module:


#   Failed test 'POD test for blib/lib/Mac/Errors.pm'
#   at .../Test/Pod.pm line 182.
# blib/lib/Mac/Errors.pm (2776): Non-ASCII character seen before =encoding in 'donÍt'. Assuming ISO8859-1
# Looks like you failed 1 test of 2.
t/pod.t ...........
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/2 subtests

Unfortunately, the Pod tests are the sort that shouldn’t stop an installation, which is why many developers have a separate area for author tests (which I’ll cover in an upcoming Item). Outside of that, you have to fix the Pod.

There are two things here. First, I have a genuine error here. The module is auto generated from other source files and the “donÍt” is a mistake; it should be “don’t” (with a smart quote) or even better, “don’t”. Test::Pod didn’t catch this before. So, that’s not bad.

More importantly, telling Perl that the source code is UTF-8 isn’t enough. When you use the utf8 pragma, the perl interpreter reads the source as UTF-8:

use utf8;

However, a Pod parser ignores all the code. It looks for Pod sections and never sees that pragma, nor does it care. You have to tell the pod which encoding you have if you want to use something outside of ASCII:

=encoding utf8

I hadn’t used that in Mac::Errors, or any of my other modules, although in some of them I had used genuine UTF-8 sequences. Now any person using Test::Pod with the latest Pod::Simple won’t be able to install those modules normally. That is, until I fix them.

I could use other encodings, such as ISO-8859-1, as long as I declare the right thing and save the file correctly.

Things to remember

  • The utf8 pragma doesn’t affect the Pod
  • Pod::Simple assumes ASCII unless you tell it otherwise
  • Declare your Pod encoding with the =encoding directive

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Hide namespaces from PAUSE

The Perl Authors Upload Server (PAUSE) is responsible for analyzing distributions on their way to CPAN. PAUSE indexes the distributions to discover the package names that it contains so it can add them to the data files that many of the CPAN clients use to figure out what to download to install the module that you request. It also compares the package names that it finds to a list of permissions it maintains.

The mldistwatch program is reponsible for this bit. It tries two things to find the packages in the distribution. If it can read the META.yml (or META.json) to get the data. Otherwise, it examines files directory to look for package declarations. Sometimes these come up with the wrong answers.

First, there’s an easy fix to package statements in code. After ignoring text in Pod or after __END__ or __DATA__, PAUSE looks for package statements appearing on a single line:

# from PAUSE::pmfile::packages_per_pmfile()
           $pline =~ m{
                      (.*)
                      \bpackage\s+
                      ([\w\:\']+)
                      \s*
                      (?: $ | [\}\;] | ($version::STRICT) )
                    }x

If your complete package statement isn’t on a single line, then that won’t match it. Since Perl has insignificant whitespace, including vertical whitespace, you could to this:

package
    hide::this::package;

You might even leave yourself (or other developers) a note about the importance of that whitespace now:

package # hide from pause
    hide::this::package;

Indeed, if you grep CPAN, you’ll find hide from pause in many distributions.

That’s the easy way, although it’s kludgey and relies on a special case in the PAUSE code. Other indexers might not honor it. There’s a better way for you to explicitly tell an indexer what namespaces you want to advertise. You add them to the provides section of META.yml:

provides:
  Cats::Buster:
	file: lib/Cats/Buster.pm
	version: 0.01

The data in come from the META-spec. Module::Build will automatically create these entries in META.yml for you. The indexer can use these to know what’s in the distribution without directly examining module files.

If there are multiple packages declarations, all of them shown up in META.yml:

provides:
  Cats::Buster:
    file: lib/Test/Provides.pm
    version: 0.01
  Cats::Mimi:
    file: lib/Test/Provides.pm
    version: 0,02
  version:
    file: lib/Test/Provides.pm
    version: 0.01

Notice that version shows up in that list. You may have included it in your module to extend or override parts of that core module, but you don't want people who want the real version to install your module to get it. You might only declare that package in your module as a temporary workaround and don't intend it to be a permanent part of the work. They probably wouldn't be able to do that anyway since PAUSE would recognize that you included a package for which you do not have permissions and would not index it. A site such as CPAN Search might mark your otherwise good distribution as "UNAUTHORIZED. Module::Build doesn't know to exclude version, at least not by default.

To hide that package from indexers, you can specify it in no_index. In Build.PL, you can use META_ADD to specify that parts of the META-spec not already supported by other arguments to new:

use Module::Build;

my $builder = Module::Build->new(
	...,
	meta_add => {
		no_index => {
			package   => [ qw( version Local ) ],
			directory => [ qw( t/inc inc ) ],
			file      => [ qw( t/lib/test.pm ) ],
			namespace => [ qw( Local ) ],
			},
		},
);

The directory and file keys tell the indexer to ignore those parts of the distribution. The package tells the indexer to ignore exactly those packages. The curious one is namespace, which tells the indexer to ignore namespaces under that namespace.

Likewise, you can do the same in Makefile.PL with a recent enough version:

use ExtUtils::Makemaker 6.48;

WriteMakefile(
	...,
	META_ADD => {
		no_index => {
			package   => [ qw( version Local ) ],
			directory => [ qw( t/inc inc ) ],
			file      => [ qw( t/lib/test.pm ) ],
			namespace => [ qw( Local ) ],
			},
		},
	);

Otherwise, it examines the module files to find package statements, but it does it without running the code.

But, what if provides and no_index have conflicting instructions? The META-spec doesn't give any guidance for indexers in those cases. PAUSE filters on no_index last. This means that PAUSE and other indexers might leave out files you specify in provides but then exclude in no_index.

Things to remember

  • Spread the package statement over two or more lines to hide it from PAUSE
  • Use provides to advertise the namespaces a distribution comprises.
  • Use no_index to limit what an indexer sees or reports.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Don’t use auto-dereferencing with each or keys

Perl 5.14 added an auto-dereferencing features to the hash and array operators, and I wrote about those in Use array references with the array operators. I’ve never particularly liked that feature, but I don’t have to like everything. Additionally, Perl 5.12 expanded the job of keys and values to also work on arrays.

chromatic has explicated a problem with each, which is both an array and hash operator. He details it in Inadvertent Inconsistencies: each in Perl 5.12 and Inadvertent Inconsistencies: each versus Autoderef. In short, if you use it with a reference, Perl doesn’t know until it actually executes the each if it’s going to use it’s array or hash behavior (and in some cases, blow up with either). However, as the programmer, I probably know which behavior I want:

while( my( $index, $value ) = each $ref ) { my $elem = $other_array->[$index]; } # I want array behavior
while( my( $key, $value ) = each $ref ) { ... } # I want hash behavior

The problem isn’t when it blows up, which is easy to catch (it blows up). If you get the wrong sort of reference, you’ll get nonsensical indices or keys. If you have an array reference, you’ll get numbers with the first return value. If you have a hash reference, you’ll get strings. If you get strings but treat them as array indices, you’ll likely always get array index 0, unless the key is a number. You might even get an odd index. If the key is 123Buster, you’ll get array index 123 due to Perl’s numification. Going the other way, using an array reference when you expected a hash, you’ll have to find keys that are whole numbers.

Effective programs reduce ambiguity in their code, but this new feature increases it. It’s easy to fix; you dereference them yourself. If you have the wrong reference type, you’ll find out right away:

while( my( $index, $value ) = each @$ref ) { my $elem = $other_array->[$index]; } # I want array behavior
while( my( $key, $value ) = each %$ref ) { ... } # I want hash behavior

If you really wanted to keep the auto-dereferencing feature, you could check the reference type before you use it, but what’s the point of saving a character with the auto-dereferencing if you have to wrap the whole thing in a guard condition?

if( ref $ref eq ref [] ) {
    while( my( $index, $value ) = each @$ref ) { ... }
    }

Now keys has the same problem. You can use that either with an array or a hash, but at some point you’re probably going to have to know what sort of reference you have so you can use the key to dereference it. At that point, you effectively declare what sort of reference it should have been. If you have the wrong sort of reference, your script dies:

my $ref = [ ... ];
foreach my $key ( keys $ref ) {
    my $elem = $ref->{$index}; # Big error!
    }

This problem is the unintended consequence of letting the other array and hash operators take a scalar variable as an argument and letting the parser automatically add the bits to dereference. David Golden wanted more magic syntax and the patch wasn’t so tough. To get the nicer syntax in some cases you end up dealing with more special cases. I noted this at the time David proposed it, but his enthusiasm for the interesting parts of the problem steamrolled over the bad parts.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Look up Unicode properties with an inversion map

Perl comes with extracts of the Unicode character data, but it hasn’t been easy to look up all of the information Perl knows about a character. Perl v5.15.7 adds a way to created an inverted map based on the property that you want to access.

The Unicode::UCD module gives you access to some of the information about a character:

use Unicode::UCD 'charinfo';
use charnames qw(:full);
use Data::Dumper;

my $charinfo   = charinfo(
	ord( "\N{SMILING CAT FACE WITH OPEN MOUTH}" )
	);
print Dumper( $charinfo );

The output has many of the properties, but not all of them:

$VAR1 = {
		  'digit' => '',
		  'bidi' => 'ON',
		  'category' => 'So',
		  'code' => '1F63A',
		  'script' => 'Common',
		  'combining' => 0,
		  'upper' => '',
		  'name' => 'SMILING CAT FACE WITH OPEN MOUTH',
		  'unicode10' => '',
		  'decomposition' => '',
		  'comment' => '',
		  'mirrored' => 'N',
		  'lower' => '',
		  'numeric' => '',
		  'decimal' => '',
		  'title' => '',
		  'block' => 'Emoticons'
		};

This doesn’t include the Age of the character, that is, when the character was added to Unicode. This might seem like a silly thing to know, but it came in handy typesetting Programming Perl. We had problems with some characters but we couldn’t see a pattern until we looked at the age of all the problem characters. Any character added after Unicode 4.0 didn’t typeset correctly. It took some annoying work to get the age by scanning through each age until that property matched:

#!/Users/brian/bin/perls/perl5.15.7

use v5.10;
use utf8;

use List::Util qw(first);

my @chars =  ( 'a', '→', '⣽', "\N{SMILING CAT FACE WITH OPEN MOUTH}" );

my @ages = qw( 1.1 2.1 2.0 3.0 3.1 3.2 4.0 4.1 5.0 5.1 5.2 6.0 );

foreach my $char ( @chars ) {
	my $age = first { $char =~ /\p{Age=$_}/ } @ages;
	say "Age: $age";
	}

It works, but it’s an unsatisifying kludge:

a Age: 1.1
→ Age: 1.1
⣽ Age: 3.0
😺 Age: 6.0

Now, Unicode::UCD has a prop_invmap to create an index based on a property you choose and a _search_invlist to return the offset in the map:

#!/Users/brian/bin/perls/perl5.15.7

use 5.15.7;
use utf8;

use charnames qw(:full);
use List::Util qw(first);
use Unicode::UCD;

my @chars =  ( 'a', '→', '⣽', "\N{SMILING CAT FACE WITH OPEN MOUTH}" );

my @ages = qw( 1.1 2.1 2.0 3.0 3.1 3.2 4.0 4.1 5.0 5.1 5.2 6.0 );

foreach my $char ( @chars ) {
	my $age = age_of_char( $char );
	say "$char Age: $age";
	}

sub age_of_char {
	my( $char ) = @_;
	# create the inverted list, once
	# can only initialize as scalar
	state $inv = _make_age_inverted_list();

	my $i = Unicode::UCD::_search_invlist($inv->[0], ord $char);
	return $inv->[1][$i];
	}

# create the inverted list, once
sub _make_age_inverted_list {
	state( $list, $map, $format, $default, $init );
	unless( $init++ ) {
		($list, $map, $format, $default) = Unicode::UCD::prop_invmap("Age");
		$format eq "s" || die "wrong format $format";
		}
	return [ $list, $map ];
	}

That looks like a lot of work, but most of it happens once to setup the inversion map.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Fold cases properly

You might think that you know how to compare strings regardless of case, and you’re probably wrong. After you read this Item, you’ll be able to do it correctly and without doing any more work than you were doing before. Perl handles all the details for you.

If you grew up in the ASCII world, case insensitivity is a difference of literally one bit, so changing case is setting or unsetting a bit in the octet that represents that character.

If you’ve read the Perl FAQ, you may have seen this quip:

“Perl” is the name of the language. Only the “P” is capitalized. The name of the interpreter (the program which runs the Perl script) is “perl” with a lowercase “p”.

When Larry Wall was asked what the difference between “Perl” and “perl”, he said “One bit”. It’s literally a difference of flipping one bit in the ASCII representation. That’s as complicated as ASCII case folding gets.

The capital letter P has the ordinal value 0b1010000. The small letter p, which shows up later in the ASCII sequence, has the ordinal value 0b1110000. This makes it extremely easy to write routines to change between upper and lower cases:

use v5.10;

say "  U L";
say "-----";

foreach my $char ( qw(p P a b c A B C) ) {
	my $lower = chr( ord($char) | 0b0100000 );
	my $upper = chr( ord($char) & 0b1011111 );

	say "$char $upper $lower";
	}

The output shows what you’d expect for the upper and lower cases:

  U L
-----
p P p
P P p
a A a
b B b
c C c
A A a
B B b
C C c

Since bit flipping is easy to do, it’s very easy for even primitive computers to quickly change case (assuming that you’re not so primitive as to not have two cases). But, this only works if you restrict the output to the ASCII letters. If you want to handle non-letters, you have to do a bit more work to ensure that you don’t shift them into other characters:

use v5.10;

say "  U L";
say "-----";

foreach my $char ( qw(p P a b c A B C # !) ) {
	my $upper = uppercase( $char );
	my $lower = lowercase( $char );

	say "$char $upper $lower";
	}

 sub lowercase {
 	my $_ = shift;
  	my $ord = ord();

 	return $_ unless $ord >= 0x41 and $ord <= 0x5A;
	return chr( $ord ^ 0b100000 );
	}

 sub uppercase {
 	my $_ = shift;
 	my $ord = ord();

 	return $_ unless $ord >= 0x61 and $ord <= 0x7A;
	return chr( $ord ^ 0b100000 );
	}

Now the non-letters stay the same character:

  U L
-----
p P p
P P p
a A a
b B b
c C c
A A a
B B b
C C c
# # #
! ! !

This almost works for Latin-* encodings too. When you move out of the ASCII sequence into Unicode, you don't have this luxury, and it's not merely a representational issue.

If you were infected with ASCII early, you've grown up thinking that you can go back and forth between upper and lower cases and always get the same result. Outside of ASCII, that's not necessarily true. Consider the word "Reichwaldstraße", a common street name in Germany. The "straße" has the special character ß (U+00DF ʟᴀᴛɪɴ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ꜱʜᴀʀᴘ ꜱ). which is a ligature of a long s, the fancy ſ (U+017F ʟᴀᴛɪɴ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ʟᴏɴɢ ꜱ) that you may have seen in historical documents, and the familiar short s. Put them together, ſs, and move them close enough and you can see how you would end up with ß once you connect the hanging portion of the long s with the top of the short s. The UCS has an uppercase version (U+1E9E ʟᴀᴛɪɴ ᴄᴀᴘᴛɪᴀʟ ʟᴇᴛᴛᴇʀ ꜱʜᴀʀᴘ ꜱ), although no one uses it aside from saying that no one uses it. U+1E9E lowercases to U+00DF, but U+00DF has no single character uppercase version; it's the two characters SS. The lowercase of SS, however, is ss:

use utf8;

my $string = "Reichwaldstraße";

my $upper = uc( $string );
my $lower = lc( $upper  );

print <<"HERE";
Started with: $string
Upper:        $upper
Lower:        $lower
HERE

The output shows that you don't get back to the original:

Started with: Reichwaldstraße
Upper:        REICHWALDSTRASSE
Lower:        reichwaldstrasse

There's another s that causes problems: the Greek sigma, which comes in two lowercase forms. One appears in the middle of words and the other appears at the end, as in όσος, where σ and ς represent the same thing, just in different forms mandated by their position:

use utf8;

my $char = "όσος";

my $upper = uc( $char );
my $lower = lc( $upper );

print <<"HERE";
Started with: $char
Upper:        $upper
Lower:        $lower
HERE

Again, the lowercase version at the end is different than what you started with:

Started with: όσος
Upper:        ΌΣΟΣ
Lower:        όσοσ

This means that you can't merely use lc to normalize text for case insensitive comparison. These won't compare correctly:

lc( "Reichwaldstraße" ) eq lc( "REICHWALDSTRASSE" );  # Nope!
lc( 'όσος' ) eq lc( 'ΌΣΟΣ' );                         # Nope!

You might object that these are different strings and that they shouldn't be the same, but where did these strings start? Perhaps that REICHWALDSTRASSE was not originally all uppercase, but changed by some stupid filters between you and the original information (and with a name like mine, I know about stupid casing filters). That's part of the ASCII infection.

So, lc is the wrong way. Sadly, we do this incorrectly in Learning Perl, when we show this subroutine we want to sort:

sub case_insensitive { "\L$a" cmp "\L$b" }

The Unicode specification solves this with its case folding rules. In short, it folds characters with different case forms into a common form. There's not a rule for this; they do it by exhaustion, specifying the common form for each fold. The common form is defined in the Unicode Character Database, which the Perl developers have digested into the files you find in the unicore/ directory in your Perl library. Here's a few lines from unicore/CaseFolding.txt:

0050; C; 0070; # LATIN CAPITAL LETTER P
0051; C; 0071; # LATIN CAPITAL LETTER Q
0052; C; 0072; # LATIN CAPITAL LETTER R
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
03A3; C; 03C3; # GREEK CAPITAL LETTER SIGMA
03C2; C; 03C3; # GREEK SMALL LETTER FINAL SIGMA
FB00; F; 0066 0066; # LATIN SMALL LIGATURE FF
FB01; F; 0066 0069; # LATIN SMALL LIGATURE FI
FB02; F; 0066 006C; # LATIN SMALL LIGATURE FL
FB03; F; 0066 0066 0069; # LATIN SMALL LIGATURE FFI
FB04; F; 0066 0066 006C; # LATIN SMALL LIGATURE FFL

The first column is the code number of the original character, the second is the type of folding (explained in the data file and coming up later), and the third column are the code numbers that form the common, folded ("equivalent") version. Essentially, it's a big hash. Notice that some of the folded versions are multiple characters. You're not going to get that with bit fiddling.

Case folding takes the character in the first column and turns them into the characters in the third column, then takes the result and does it again until there are no more folds possible. It keeps doing that until there is nothing to replace. Characters that don't have an entry in this file fold into themselves. You case fold to compare strings, not to normalize strings for storage or other uses. Case folding makes case insensitive comparisons very fast, but it also loses information that you can't recover. You can read the exact rules in Section 5.18, "Case mapping", of the Unicode Standard.

To see how that works, try that with Reichwaldstraße and όσος. All characters except two stay the same, and two use the mapping from unicore/CaseFolding.txt:

  • Reichwaldstraße → reichwaldstrasse
  • REICHWALDSTRASSE → reichwaldstrasse
  • όσος → ΌΣΟΣ
  • ΌΣΟΣ → όσοσ

To implement these operations, Perl v5.16 adds the fc built-in function. Instead of lc, use that:

use v5.15.8;  # until we get v5.16  XXX feature
fc( "Reichwaldstraße" ) eq fc( "REICHWALDSTRASSE" );  # Yep!
fc( 'όσος' ) eq fc( 'ΌΣΟΣ' );                         # Yep!

If you don't have v5.16, you can use the fc front the Unicode::CaseFold module on CPAN.

If you wanted to do this inside a double-quoted string, you can use the \F case shift operator (but be aware of the things we noted in Understand the order of operations in double quoted contexts). Our Learning Perl example could change to:

sub case_insensitive { "\F$a" cmp "\F$b" }

More complicated folds

Looking back at the extract of unicore/CaseFolding.txt, you might remember that I skipped over the second column, the mapping status. Those letters stand for different folding rules:

  • C: common case folding
  • F: full case folding (strings may grow in length)
  • S: simple case folding (map to single characters)
  • T: special case for uppercase I and dotted uppercase I

The "T" status stands in for folds that the general rules can't handle, mostly some characters from Turkish and similar languages.

So far, Perl's fc only handles the "F" status for full case folding. It doesn't handle the special folding you'll find in unicore/SpecialCasing.txt that has the oddball situations, such as multiple source characters folding onto other multiple characters. If you want to handle those, you're on your own, although the Unicode::Casing module on CPAN might help.

Many of the folding rules depend on the source language, so you'll probably want to pay special attention if you are using that language or completely ignore them if you are not.

Besides that, the Universal Character Set gives people much more of a chance to mess up. Suppose that you want to write "β-carotene", that thing you get from carrots. That first character is β (U+03B2 ɢʀᴇᴇᴋ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ʙᴇᴛᴀ). Some people might think it looks like ß (U+00DF ʟᴀᴛɪɴ ꜱᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ꜱʜᴀʀᴘ ꜱ), and that's good enough for them. No amount of case folding is going to let you know that someone used an incorrect character. But, this is also one of the benefits of Unicode: characters know what they are.

Another correct way

There's another correct way to check strings regardless of case. You can use the /i flag on the match operator. The Unicode-aware Perl regex engine handles the rest:

use utf8;
use v5.15.7;

use Set::CrossProduct;

my $string = "Reichwaldstraße";

my $upper = uc( $string );
my $lower = lc( $upper  );

my $sets = Set::CrossProduct->new(
	[
	[ $string, $upper, $lower ],
	[ $string, $upper, $lower ],
	]
	);

foreach my $tuple ( $sets->combinations ) {
	my( $l, $r ) = @$tuple;
	next if $l eq $r;

	say "lc($r) eq lc($l)  ? ", lc($r) eq lc($l) ? "matched" : "failed";
	say "fc($r) eq fc($l)  ? ", fc($r) eq fc($l) ? "matched" : "failed";
	say "$r =~ m/$l/i      ? ", $l =~ m/$r/i ? "matched" : "failed";

	say;
	}

In the output, you can see that lc sometimes fails, but that the fc and m//i always works:

lc(REICHWALDSTRASSE) eq lc(Reichwaldstraße)  → failed
fc(REICHWALDSTRASSE) eq fc(Reichwaldstraße)  → matched
REICHWALDSTRASSE =~ m/Reichwaldstraße/i      → matched

lc(reichwaldstrasse) eq lc(Reichwaldstraße)  → failed
fc(reichwaldstrasse) eq fc(Reichwaldstraße)  → matched
reichwaldstrasse =~ m/Reichwaldstraße/i      → matched

lc(Reichwaldstraße) eq lc(REICHWALDSTRASSE)  → failed
fc(Reichwaldstraße) eq fc(REICHWALDSTRASSE)  → matched
Reichwaldstraße =~ m/REICHWALDSTRASSE/i      → matched

lc(reichwaldstrasse) eq lc(REICHWALDSTRASSE)  → matched
fc(reichwaldstrasse) eq fc(REICHWALDSTRASSE)  → matched
reichwaldstrasse =~ m/REICHWALDSTRASSE/i      → matched

lc(Reichwaldstraße) eq lc(reichwaldstrasse)  → failed
fc(Reichwaldstraße) eq fc(reichwaldstrasse)  → matched
Reichwaldstraße =~ m/reichwaldstrasse/i      → matched

lc(REICHWALDSTRASSE) eq lc(reichwaldstrasse)  → matched
fc(REICHWALDSTRASSE) eq fc(reichwaldstrasse)  → matched
REICHWALDSTRASSE =~ m/reichwaldstrasse/i      → matched

The match operator isn't useful for sort though, since you can only tell if the strings are the same.

Things to remember

  • Case-folding is more complicated than merely lowercasing.
  • The fc does proper case folding according to the Unicode standard.
  • The \F case fold operator does full case folding in double-quoted contexts.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Use __SUB__ to get a reference to the current subroutine

What if you want to write a recursive subroutine but you don’t know the name of the current subroutine? Since Perl is a dynamic language and code references are first class objects, you might not know the name of the code reference, if it even has a name. Perl 5.16 introduces __SUB__ as a special sequence to return a reference to the current subroutine. You could almost do the same thing without the new feature, but each of those have drawbacks you might want to avoid.

Although __SUB__ looks like __FILE__, __LINE__, and __PACKAGE__, each of which are compile-time directives, the __SUB__ happens at run time so you can use it with subroutines you define later.

First, consider how you’d try to do this without the __SUB__ feature. You could declare a variable to hold a subroutine reference then in a later statement define the subroutine. Since you’ve already declared the variable, you can use it in the definition. Perl won’t de-reference it until you actually run the subroutine, so it doesn’t matter that it’s not a reference yet:

use v5.10;

my $sub;

$sub = sub {
	state $count = 10;
	say $count;
	return if --$count < 0;
	$sub->();
	};

$sub->();

Your output is a countdown:

10
9
8
7
6
5
4
3
2
1
0

To do that, there are two requirements: the code reference must be stored in a variable, and the variable must already be defined. That’s not always convenient. Not only that, your anonymous subroutine contains a reference to itself, so you’d either have to play games with weak references or just let the reference live forever. Neither of those are attractive.

Rafal Garcia-Suarez solved these problems by creating Sub::Current to give you a ROUTINE function that returns a reference to the current subroutine, even if it is a named subroutine:

use v5.10;
use Sub::Current;

sub countdown {
	state $count = 10;
	say $count;
	return if --$count < 0;
	ROUTINE->();
	};

countdown();

You might want to define these code references as a single statement, even you don’t need to. This is useful for inline subroutines where you want to define the code reference in the parameter list:

use v5.10;
use Sub::Current;

sub run { $_[0]->() };

run( sub {
		state $count = 10;
		say $count;
		return if --$count < 0;
		ROUTINE->();
		}
	);

You may want to define the subroutine in one statement as a return value:

use v5.10;
use Sub::Current;

sub factory {
	my $start = shift;
	sub {
		state $count = $start;
		say $count;
		return if --$count < 0;
		ROUTINE->();
		}
	};

factory(4)->();

Using this module has the disadvantage of a CPAN dependency, although a very light one because it’s self contained. There’s another module, Devel::Caller, from Richard Clamp that can can get a code reference from any level in the call stack, including the current level:

use v5.10;
use Devel::Caller qw(caller_cv);

sub factory {
	my $start = shift;
	sub {
		state $count = $start;
		say $count;
		return if --$count < 0;
		caller_cv(0)->();
		}
	};

factory(7)->();

Perl 5.16 lets you do the same thing without the CPAN module:

use v5.15.6;  # until v5.16 is released

sub factory {
	my $start = shift;
	sub {
		state $count = $start;
		say $count;
		return if --$count < 0;
		__SUB__->();
		}
	};

As with many new features added since Perl v5.10, you can enable __SUB__ with a use VERSION statement,
as you see in the previous example, or with the feature pragma and the current_sub import:

use feature qw(say state current_sub);

sub factory {
	my $start = shift;
	sub {
		state $count = $start;
		say $count;
		return if --$count < 0;
		__SUB__->();
		}
	};

factory(7)->();

Things to remember

  • Perl v5.16 provides the __SUB__ directive to return a reference to the currently running subroutine
  • Import this new feature by requiring the Perl version or through
    the feature pragma

  • Prior to Perl v5.16, you can do this the same thing with Sub::Current

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit