If you are playing with Unicode, you’re probably going to want to convert to the various normalization forms. There are some programs to do this in the Unicode::Tussle distribution, but you can also create some one-liners to do this as well (Item 120. Use Perl one-liners to create mini programs).
If you want to read and write lines, you can use the
-n switch to wrap a
while loop around your tiny program. In this case, those tiny programs just call a normalization function from
Unicode::Normalize. Here are the bash aliases:
alias nfc="perl5.14.1 -MUnicode::Normalize -CS -ne 'print NFC(\$_)'" alias nfd="perl5.14.1 -MUnicode::Normalize -CS -ne 'print NFD(\$_)'" alias nfkd="perl5.14.1 -MUnicode::Normalize -CS -ne 'print NFKC(\$_)'"
You can run these as if they were programs with those names. Here you convert those ligature characters, ï¬ (U+FB01) and ï¬‚ (U+FB02), to their compatible, two-character forms fi and fl as it reads standard input:
$ nfkd Let's ï¬nd that ï¬‚ying squirrel! Let's find that flying squirrel!
If you wanted to do it with command line arguments as strings instead of files, it’s a couple small changes. You can add the
A flag to the
-C switch to interpret the command-line arguments as UTF-8 (unless you want to decode it yourself), and use
say to add the newline in the output:
alias nfc="perl5.14.1 -MUnicode::Normalize -CSA -E 'say NFC( qq(@ARGV) )'" alias nfd="perl5.14.1 -MUnicode::Normalize -CSA -E 'say NFD( qq(@ARGV) )'" alias nfkd="perl5.14.1 -MUnicode::Normalize -CSA -E 'say NFKC( qq(@ARGV) )'"
The output decomposes the ligatures just as before:
nfkd "Let's ?nd that ?ying squirrel." Let's find that flying squirrel.
You can read more about these program features in Item 73. Tell Perl which encoding to use and Item 77. Work with graphemes instead of characters.