Some special Unicode shell aliases to normalize strings

If you are playing with Unicode, you’re probably going to want to convert to the various normalization forms. There are some programs to do this in the Unicode::Tussle distribution, but you can also create some one-liners to do this as well (Item 120. Use Perl one-liners to create mini programs).

If you want to read and write lines, you can use the -n switch to wrap a while loop around your tiny program. In this case, those tiny programs just call a normalization function from Unicode::Normalize. Here are the bash aliases:

alias nfc="perl5.14.1 -MUnicode::Normalize -CS -ne 'print NFC(\$_)'"
alias nfd="perl5.14.1 -MUnicode::Normalize -CS -ne 'print NFD(\$_)'"
alias nfkd="perl5.14.1 -MUnicode::Normalize -CS -ne 'print NFKC(\$_)'"

You can run these as if they were programs with those names. Here you convert those ligature characters, fi (U+FB01) and fl (U+FB02), to their compatible, two-character forms fi and fl as it reads standard input:

$ nfkd
Let's find that flying squirrel!
Let's find that flying squirrel!

If you wanted to do it with command line arguments as strings instead of files, it’s a couple small changes. You can add the A flag to the -C switch to interpret the command-line arguments as UTF-8 (unless you want to decode it yourself), and use say to add the newline in the output:

alias nfc="perl5.14.1 -MUnicode::Normalize -CSA -E 'say NFC( qq(@ARGV) )'"
alias nfd="perl5.14.1 -MUnicode::Normalize -CSA -E 'say NFD( qq(@ARGV) )'"
alias nfkd="perl5.14.1 -MUnicode::Normalize -CSA -E 'say NFKC( qq(@ARGV) )'"

The output decomposes the ligatures just as before:

nfkd "Let's ?nd that ?ying squirrel."
Let's find that flying squirrel.

You can read more about these program features in Item 73. Tell Perl which encoding to use and Item 77. Work with graphemes instead of characters.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit

Leave a comment

3 Comments.

  1. My host uses “perl, v5.8.8 built for x86_64-linux-thread-multi”, and it is not in my hand that they upgrade.

    Any chance you could made those aliases work with the version of perl? Now it complains:

    Unrecognized switch: -E (-h will show valid options).
    0000000

  2. By the way, a good way to know whether this commands are doing anything is converting the output string to hex (the shell will show no difference when changing from NFC or NFD or viceversa):

    nfd “tést” | od -xc

Leave a Reply

You must be logged in to post a comment.