This post is part of a series on Mohammad Anwar’s excellent Weekly Challenge, where hackers submit solutions in Perl, Raku, or any other language, to two different challenges every week. (It’s a lot of fun, if you’re into that sort of thing.)
Task #2 this week was one of my devising:
It is thought that the following sequence will always reach 1:
\(Collatz(n) = \begin{cases}
n \div 2 & n \text{ is even} \\
3n + 1 & n \text{ is odd}
\end{cases}
\)
For example, if we start at 23, we get the following sequence:
23 → 70 → 35 → 106 → 53 → 160 → 80 → 40 → 20 → 10 → 5 → 16 → 8 → 4 → 2 → 1
Write a function that finds the Collatz sequence for any positive integer. Notice how the sequence itself may go far above the original starting number.
Extra Credit
Have your script calculate the sequence length for all starting numbers up to 1000000 (1e6), and output the starting number and sequence length for the longest 20 sequences.
I’ve always liked the Collatz conjecture. It is simple enough for schoolchildren to play with, yet the math to prove the conjecture is still beyond our greatest mathematicians.
Here is how I solved this task.
Raku
Finding a sequence for an arbitrary starting number, $n
, is easy:
#| Non extra-credit Collatz sequence
sub collatz( Int $n is copy ) {
my @r = $n;
while ( $n ≠ 1 ) {
$n = $n %% 2 ?? ($n / 2).Int !! (3*$n + 1).Int;
@r.push: $n;
}
@r;
}
Extra Credit
This is where it gets interesting. As Raku is quite slow, I wanted to make sure that the Extra Credit would be solvable in a reasonable amount of time. (Say, less than a minute.) In order to get the runtime down, we need two things:
- A fast way to calculate the sequence length, and
- A fast way to maintain the top 20 top starting numbers
1. Memoization
For #1, we’ll use memoization. Since all sequences (certainly all sequences below our limit) end up at 1, and tend to get smaller, we’ll see repeat numbers as we go through the starting numbers. So storing the sequence length for all numbers we’ve seen will result in a huge speed boost:
my $top-n = 20; # Number of top sequences to list
my $limit = 1e6; # Highest starting number
my $mintop = 0; # Minimum value in @top (efficiency/convenience)
my @top = 0 => 0, 1 => 1; # Top N list (start => seq-len)
my @memo = (0, 1); # Memoization (@memo[start] = seq-len)
# Iterate through all starting numbers
for 3..$limit -> $start {
my Int $n = $start;
my Int $len = 0;
# Keep going through the sequence until we hit a memoized value
while (!@memo[$n]) {
$len += 1 + $n % 2;
$n = $n %% 2 ?? ($n / 2).Int !! ((3*$n + 1) / 2).Int;
}
$len += @memo[$n];
@memo[$start] = $len if $start < $limit * 2;
}
You’ll note there is a little optimization in there: if $n
is odd, I do (3n + 1) / 2, and increment $len
by an extra 1.
By itself, that code would fill up @memo
with the sequence length for all starting numbers up to a million. On my virtual machine, that runs in about 32 seconds.
2. O(n) insertions
We haven’t done anything with @top
yet. We could do something simple like this:
# Definitely NOT O(n) inserts!
@top.push: $start => $len if $len > $top[*-1].value;
@top.push: $n * 2 => @memo[$n] + 1 if $n <= $limit / 2 and @memo[$n] > $top[*-1].value;
@top = @top.sort( -*.value );
@top.pop while @top < $top-n;
However, all of that sorting starts to really add up, even though we are only sorting 20 elements at a time. With the above block at the end of the loop, the script’s run time goes from 32 seconds to 3 minutes 30 seconds. Ouch. Fortunately, we can avoid the sort altogether, and insert into @top
in linear time with the following routine:
# Sorted insert [ $n, $len ] to @top, keep @top to $top length
sub top {
my ($n, $len) = @_;
my $idx = first { $top[$_][1] < $len } 0..$#top;
splice @top, $idx, 0, [ $n, $len ];
pop @top if @top > $top;
$mintop = $top[-1][1];
}
Then, we just exchange the slow code, above, for this:
top($start, $len) if $len > $mintop and $start ≤ $limit;
top($n * 2, @memo[$n] + 1) if $n ≤ $limit / 2 and @memo[$n] > $mintop;
Et voilà! The run time plummets to 34 seconds, which is only a couple of seconds slower than my baseline code, having no top-n list at all.
Where did all of my CPU cycles go?
The version with sort
is doing O(n log n) sorts. 20 log 20 is still constant time in the context of the overall program, so it would be easy to overlook, especially when 20 log 20 is only about three times larger than the O(n) = 20. So, while I’m not surprised my O(n) insert is faster, I am a bit surprised it is 60 times faster. Remember, my O(n) insert added ~2 seconds to the run time. The sort
version added ~180 seconds to the runtime.
My hunch (or conjecture, if you will) is that the closure in sort( -*.value )
may be the culprit, but I don’t have time to do more detailed profiling right now.
Perl
The Perl code is quite similar to the Raku code. I did not bother with the sequence printing; I went straight for the extra credit!
my @seqlen = (-1,1); # Memoize sequence length
my $top = 20; # Report this many of the top sequences
my @top = [ -1,-1 ]; # Top $top sequences
my $upper = 1e6; # Upper limit starting term
my $mintop = 0; # Lowest value in @top
GetOptions('top=i' => \$top, 'upper=i' => \$upper);
# Run through the upper limit
for (my $start = 3; $start < $upper; $start += 2) {
my ($n, $len) = ($start, 0);
while (! defined $seqlen[$n]) {
$len += 1 + $n % 2;
$n = $n % 2 ? (3*$n + 1)/2 : $n / 2;
}
$len += $seqlen[$n];
$seqlen[$start] = $len if $start < $upper * 2; # Cache
top($start => $len) if $len > $mintop and $start <= $upper;
top($n * 2 => $seqlen[$n] + 1) if $n < $upper/2 and $seqlen[$n] > $mintop;
}
# Report top sequences
printf "Collatz(%5d) has sequence length of %3d steps\n", @$_ for @top;
# Sorted insert [ $n, $len ] to @top, keep @top to $top length
sub top {
my ($n, $len) = @_;
my $idx = first { $top[$_][1] < $len } 0..$#top;
splice @top, $idx, 0, [ $n, $len ];
pop @top if @top > $top;
$mintop = $top[-1][1];
}