Match only the same Unicode script

Earlier this year, this website was the target of some sort of attack in which a bot sent seemingly random data in its requests. The attack wasn’t that big of a deal since I easily blocked it with Cloudflare, but it was interesting. The apparently random data was actually a mix of Latin, Hangul, and Cyrillic. Domain hacks with unusual Unicode characters shows some of these exploits. Curiously, v5.28 added some regex feature that deals with this sort of nonsense.


Unicode Scripts

Unicode has the notion of scripts-sets of graphical symbols that represent a particular writing system. For example, the characters in Γεια σου κόσμε are all in the Greek script, and the characters in ᎣᏏᏲ ᎡᎶᎯ are all in the Cherokee script. All of those characters are also alphabetic, as well as identifier characters. The \w matches all of them:

use utf8;
use open qw(:std :utf8);

# http://helloworldcollection.de/#Human
$_ = q(ᎣᏏᏲᎡᎶᎯΓειασουκόσμε);

if( /(\w+)/ ) {
	say "Found: $1";
	}

The capture matches the entire string because every character is a word character:

Found: ᎣᏏᏲᎡᎶᎯΓειασουκόσμε

Want to find out which scripts a string has? Unicode::UCD (Unicode Character Database) can do that:

use utf8;
use open qw(:std :utf8);

use Unicode::UCD qw(charscript);

my $string = 'ᎣᏏᏲᎡᎶᎯΓειασουκόσμε';
my @chars = split //, $string;

foreach my $char ( @chars ) {
	say "$char: ", charscript( ord $char );
	}

There’s your mix of Cherokee and Greek:

Ꭳ: Cherokee
Ꮟ: Cherokee
Ᏺ: Cherokee
Ꭱ: Cherokee
Ꮆ: Cherokee
Ꭿ: Cherokee
Γ: Greek
ε: Greek
ι: Greek
α: Greek
σ: Greek
ο: Greek
υ: Greek
κ: Greek
ό: Greek
σ: Greek
μ: Greek
ε: Greek

Numbers

Here’s another example, with three sets of digits from different scripts:

use utf8;
use open qw(:std :utf8);

$_ = q(۵۲۸528੫੨੮);

if( /(\d+)/ ) {
	say "Found: $1";
	}

This again finds the entire string because every one of the characters is a digit:

Found: ۵۲۸528੫੨੮

Character class tangent

You may have learned Perl when the character classes were simpler, but now they have different, more expansive semantics. The the /a flag makes the character class shortcuts match with ASCII semantics, or you could use [0-9] instead. You can make specific character classes for the numbers in any script that you like, but what if you want to match any run of digits as long as they are in the same script without knowing that script ahead of time?

Script runs

This is where the new script_run feature comes in. It’s an alpha assertion (also new in v5.28):

use v5.28;

use utf8;
use open qw(:std :utf8);

no warnings qw(experimental::script_run);

$_ = q(۵۲۸528੫੨੮);

if( /(*script_run:(\d+))/ ) {
	say "Found: $1";
	}

This only finds the digits that belong to the first script it encounters (Arabic in this case):

Found: ۵۲۸

Now you find all the runs of digits:

use v5.28;

use utf8;
use open qw(:std :utf8);

no warnings qw(experimental::script_run);

$_ = q(۵۲۸528੫੨੮);

my @script_runs = /(*script_run:(\d+))/g;
$" = "\n";
say "@script_runs";

Now the continuous line of digits is broken into groups in which all the digits are the same script:

۵۲۸
528
੫੨੮

The Common script

There’s a special script named “Common” that can be part of any script run. Characters such as the full stop (U+002E) can be part of any script run, as well as thousands of other characters:

use v5.28;

use utf8;
use open qw(:std :utf8);

my $count = 0;
foreach my $code_number ( 0 .. 0x10FFFD ) {
	next unless chr( $code_number ) =~ m/\p{Common}/;
	say chr( $code_number );
	$count++;
	}
say "There are $count Common Characters";

Any of those characters can be part of the match in a script run without interrupting it:

use v5.28;

use utf8;
use open qw(:std :utf8);

no warnings qw(experimental::script_run);

$_ = 'The number is ५.२८';

say /(*script_run:([\d.]+))/ ? "Found $1" : "No match";

The full stop shows up with the digits from the Devanagari script:

Found ५.२८

Look-alike spoofing

So, where’s the problem? Can you tell the difference between example.com and ехамрӏе.com? Assuming that something didn’t mess with my original source, the first is ASCII, and the second is a mix of ASCII and similar looking Cyrillic letters. Both .coms are in ASCII, but the domains are different. Indeed, the attack against this site was sending this sort of data. If I wanted to match a domain name:

use v5.28;

use utf8;
use open qw(:std :utf8);

no warnings qw(experimental::script_run);

my $expected = 'example.com';  # All Latin
my $spoofer  = 'ехамрӏе.com';  # some Cyrillic

my @domains = ( $expected, $spoofer );

my @naive_match = map { /\A[\w.]+\z/ } @domains;
say 'Matched ' . @naive_match . ' domains';

my @run_match = map { /\A(*script_run:[\w.]+)\z/ } @domains;
say 'Matched ' . @run_match . ' domains';

The naïve match passes through both domains, while the script run match doesn’t. Note that you need the anchors because the script run would find the run of the first script it encounters and still match just as it did for the numeric example:

Matched 2 domains
Matched 1 domains