PWC 166 › Hexadecimal Words

This post is part of a series on Mohammad Anwar’s excellent Weekly Challenge, where hackers submit solutions in Perl, Raku, or any other language, to two different challenges every week. (It’s a lot of fun, if you’re into that sort of thing.)

This is another one of my suggested tasks, and another time you’re going to read about a personal story that connects me to a task. For many years, right back to very first days I started hacking filesystems and writing kernel code, I’ve often needed to come up with 16- or 32-bit identifiers for things, usually represented in hexadecimal. And to this day, my repertoire of clever identifiers is more or less limited to 0xdeadbeef and 0xc0dedbad, and a few others. Let’s change that!

Contents

The simple version

A very simple solution simply reads the lines, filters out anything that isn’t the right length, or contains characters that can’t be substituted, and prints out the whole works:

use File::Slurper qw< read_lines >;

my $dict = $ARGV[0] // '../../../data/dictionary.txt';

say for map {     y/olist/01157/r    }
       grep { /^[0-9a-folist]{2,8}$/ } read_lines($dict);

That already gives me some great results, like 0xd15abled (“disabled”), and is nice and short. However, we didn’t come here for half measures. This version only returns single words, and doesn’t care how many substitutions it has done, so words like 0x57111e57 (“stillest”‽) are fair game.

ch-1-short.pl full source

The deluxe version

The feature creeps have been at it again. They demand:

Command line options for:
- Dictionary path (--dict)
- Required length (--length)
- Maximum substitution ratio per word (--max-sub). See below.
- Pretty print results (disable with --nopretty)
Final output consisting of words or phrases with a total length of --length

The --max-sub option limits the ratio of substitutions to word length. The default--max-sub=0.2, means 1..4 letter words can not have any substitutions, while 5..8 letter words can have one.

First, we ingest the %words as a hash of original to hex word. For example, $words{fillet} = 'f111e7'. We’ll also get a sorted word list, for convenience later. This will look somewhat familiar:

my %words =  map {            @$_          }
            grep {          filter         }
             map { [$_ => y/olist/01157/r] }
            grep {  /^[0-9a-folist]{2,}$/  } read_lines($o{dict});

my @words   = sort keys %words;

The filter() sub is new. It’s used to filter out words we don’t want:

sub filter(_) {
    my ($orig, $hex) = @{$_[0]};

    # Count number of substitutions in the word
    my $subs =()= ($orig ^ $hex) =~ /[^\0]/g;
    return if $subs > length($hex)*$o{'max-sub'};

    return if length($hex) > $o{length};
    return if length($hex) < $o{'min-length'};

    return 1; # pass
}

The highlighted line might make you scratch your head a bit. There are two tricks, here. The first is the =()= pseudo-operator. It forces whatever is to the right of it to be evaluated in array context. This is necessary, because we want $subs to contain the count of matches, rather than the first match.

Bitwise String XOR

Next up is the string-XOR trick. ($orig ^ $hex) does a bitwise string exclusive-or of $orig and $hex. I’ll explain how that works. If you already know, you can safely skip to the next section. Each character is converted to binary and XOR’d with the corresponding character from the other string. Recall that exclusive-or means “A or B but not both” are true. The truth table for bitwise A ^ B is:

A	B	A^B
0	0	0
0	1	1
1	0	1
1	1	0

Bitwise XOR

That’s two bits. What happens if we compare two characters? Let’s compare a with 3. Converted to ASCII, a is 0b01100001 and 3 is 0b00110011.

Note that since we already filtered anything not matching /[^0-9a-folist]/, our input contains only ASCII strings. If we had to deal with wide characters (e.g., UTF-8), additional care would need to be taken.

When Perl XORs two characters, every bit in the first char is XOR’d with the same bit on the other char. a ^ 3 would look like this:

  0b01100001 (a)
^ 0b00110011 (3)
----------------
  0b01010010 (R)

So 'a' ^ '3' gives us R. Neat. How does this help us? Well, the useful part comes when we XOR two identical characters. Let’s try b and b:

  0b01100010 (b)
^ 0b01100010 (b)
--------------
  0b00000000 (\0)

Thus when we XOR a character with itself, we get \0. Now consider what happens if we XOR the original word with its “hex” equivalent. For example: "abides" ^ "ab1de5" would give us: \0\0X\0\0F. Now, if we simply count how many occurrences of anything that isn’t a \0 (that’s where the /[^\0]/g comes in), we have our number of substitutions. In this example, there were two non-\0 characters, so two substitutions.

Building phrases

Now that the filter()ing is out of the way, we have a bunch of %words and their hex equivalent. How do we combine them into phrases? It’s possible to build up a list of phrases with nested loops, but it is less flexible and requires more code than a recursive solution:

sub get_phrases {
    my @phrases;

    sub {
        my $len = sum map { length } @_;

                                return if $len >  $o{length};
        push @phrases, [@_] and return if $len == $o{length};

        __SUB__->(@_, $_) for grep { $_ ge $_[-1] } @words;
    }->();

    @phrases;
}

This get_phrases() takes no input. The actual recursion is done via a closure around @phrases. Note the bare anonymous sub { ... }->() calls itself to initiate the recursion. Recursive calls are done with the __SUB__ feature available since Perl 5.16.

There are two base cases:

Line 7: If the length is already past our desired $o{length}, return
Line 8: Adds our words as a new phrase if the length equals our desired length. This could be modified to allow phrases shorter than $o{length}, but I wanted exact length results.

If neither of those base cases trigger, we recurse on every word in @words that is lexicographically greater than or equal to the last word in our phrase. This keeps our phrase in alphabetical order, and that eliminates reflexive results. I didn’t want Feed Beef if I already had Beef Feed.

To Pretty Print, or Not?

We now have everything we need; it just needs to be formatted.

my @phrases = get_phrases();

$o{pretty} ? pretty_print(@phrases) 
           : say join '', map { $words{$_} } @$_ for @phrases;

Pretty printing is enabled by default, but can be turned off with --nopretty. Without it, we simply concatenate the words for each phrase, one phrase per line. With pretty printing, though, we’ll defer to the pretty_print(@phrases) sub:

sub pretty_print {
    my $spaces = -1 + max map { 0+@$_ } @_;
    for (@_) {
        my $phrase = join ' ', map ucfirst, @$_;
        my $hexphrase = join '', map { $words{$_} } @$_;
        printf "%@{[$o{length}+$spaces]}s => %$o{length}s\n",
                $phrase, $hexphrase;
    }
}

This one’s pretty simple. It just prints the word(s) on the left, separated by spaces, and the hex word on the right. Example output:

$ ./ch-1.pl --length=12 | head -20
  Abate Acceded => aba7eacceded
  Abate Accedes => aba7eaccede5
 Abate Ace Aced => aba7eaceaced
 Abate Ace Babe => aba7eacebabe
 Abate Ace Bade => aba7eacebade
 Abate Ace Bead => aba7eacebead
 Abate Ace Beef => aba7eacebeef
 Abate Ace Cede => aba7eacecede
 Abate Ace Dead => aba7eacedead
 Abate Ace Deaf => aba7eacedeaf

ch-1.pl full source

Performance

Assuming word length is invariant, the simple version runs in linear time on the number of words.

For the more complex version, let’s define W = $o{length}/$o{'min-length'}. W is therefore the maximum number of words that will fit in our target length.

In the trivial case of W = 1, we’re still linear, since we’re only looking through the word list once. For W = 2, every word has to be paired with every other word, so that’s $O(n^2)$. In fact, we’re exponential! The asymptotic complexity is $O(n^W)$.

With n ≅ 39,000 words in my dictionary, increasing that exponent gets expensive very quickly. Some sample runtimes:

`--length` (M)	Time (sec)	# of phrases
6	0.052	131
8	0.067	618
10	0.119	2 719
12	0.381	14 039
14	1.527	65 168
16	6.773	267 847

Length vs wall time

There are some optimizations that would help, such as divide and conquer and building up partial phrases, but since we are still essentially building a Cartesian product of words, it’s never going to scale particularly well.

I didn’t spend any time optimizing, as even at --length=16, we’re at a quarter million phrases. I don’t know about you, but I’d have trouble sifting through more than a few thousand lines for the next best thing to 0xdeadbeef.

I hope you had fun with this one!