Coding elegant, simple and fast algorithms

jueves, 28 de junio de 2012

More natural sorting

I got an email this morning from somebody telling my that my Perl module Sort::Key::Natural was buggy because it was not sorting strings with embedded floating point numbers correctly.

Well, it's true, but that's not a bug, it is a feature!

The real issue is that there is not just one way to do natural sorting. On the contrary, there are several possibilities and depending on the problem you want to solve you will have to choose one or another.

Having said that, in my experience, when somebody asks for floating point number support, it's common that he would be just (ab)using natural sorting for sorting structured data that can be better sorted in other ways.

In any case, implementing a natural sorting function with float support is quite simple, here is how to do it using the key generation approach I explained on my previous post about natural sorting.

What is a float?

Before getting into the implementation details, first we have to define precisely what a floating point number ios going to be in our context.

The usual floating point representation is composed of an optional sign, followed by an integer part optionally followed by a dot and a fractional part and optionally followed by an exponent part. For instance, +12.3e2, 12.3E+2, 0.123E5, 12300, +12300.00 are all representations of the same number.

The first thing we are going to do is to get rid of the exponent part. We just do not support floating point numbers with an exponent part.

The problem here is not really the nuisance introduced by handling all this possible representations of a number but the ambiguity introduced by the letter e, that can be both part of a number and of a word.

Our second requirement is that the integer part can not be empty. For instance, .12 is not going to be recognized as 0.12 but as 12. That's because the dot is a common token separator as in foo.12.

In sumary, the floating point numbers we are going to support are those that match the regular expression /^[+\-]?\d+(\.\d*)$/

mkkey_natural

We start from the key generation function for natural sort with natural numbers:

  sub mkkey_natural {
    my $nat = shift;
    my @parts = $nat =~ /\d+|\p{IsAlpha}+/g;
    for (@parts) {
      if (/^\d/) {
        s/^0+//;
        my $len = length;
        my $nines = int ($len / 9);
        my $rest = $len - 9 * $nines;
        $_ = ('9' x $nines) . $rest . $_;
      }
    }
    return join("\0", @parts);
  }

The tokenizer

The first thing we should do is to modify the tokenizer to accept negative floating point numbers, using the following regular expression:

$nat =~ /[+\-]?\d+(?:\.\d*)?|\p{IsAlpha}+/g;

The fractional part

Now, we go for the fractional part. Handling it is really simple, we only have to remove any leading zeros from its right side as, for instance, 0.1, 0.10 and 0.100 should be equivalent; then, append it to the key generated for the integer part.

It results on:

  if (my ($sign, $number, $dec) = /^([+-]?)(\d+)(?:\.(\d*))?$/) {
    $number =~ s/^0+//;

    $dec = '' unless defined $dec;
    $dec =~ s/0+$//;
    my $len = length $number;
    my $nines = int ($len / 9);
    my $rest = $len - 9 * $nines;
    $_ = ('9' x $nines) . $rest . $number . $dec;
}

Note how two keys generated in this way are compared: the integer part is first compared by length, then by its digits and then the two fractional parts are compared by its digits from left to right.

Negative numbers

Negative numbers should always be ordered before positive integers, so we just prepend a minus sign to the key in case of a negative number (in ASCII, the minus character is placed before the digits).

Finally, in order to get negative numbers sorted in ascending order, we invert their digits replacing 0 by 9, 1 by 8, 2 by 7, etc. In Perl, we can use the tr operator for that:

tr/0-9/9876543210/

Negative zero

There is still an special case we have to handle: 0 and -0 are the same number.

We add some conditional code to do it.

mkkey_natural_with_floats

And that's it...

sub mkkey_natural_with_floats {
    my $nat = @_ ? shift : $_;
    my @parts = $nat =~ /[+\-]?\d+(?:\.\d*)?|\p{IsAlpha}+/g;
    for (@parts) {
      if (my ($sign, $number, $dec) = /^([+-]?)(\d+)(?:\.(\d*))?$/) {
        $number =~ s/^0+//;
        $dec = '' unless defined $dec;
        $dec =~ s/0+$//;
        my $len = length $number;
        my $nines = int ($len / 9);
        my $rest = $len - 9 * $nines;
        $_ = ('9' x $nines) . $rest . $number . $dec;
        if ($sign eq '-' and $_ ne '0') {
          tr/0123456789/9876543210/;
          $_ = "-$_";
        }
      }
    }
    return join("\0", @parts);
}

lunes, 25 de junio de 2012

Coprime numbers

Given two random numbers from the interval [1,n], what is the probability of both numbers being coprimes when n→∞?

My intuition was telling me that the probability should be approaching 0, but it was completely wrong as it seems to converge to some value around 0.608!

Given the two integers a and b, they are not coprimes if they are both divisible by two or if some of them is not divisible by two but they are both divisible by three or if some of them is not divisible by two, neither by three but they are both divisible by five and so on.

Numerically, the probability of being divisible both by two is p₂ = (1/2)²; the probability of otherwise being divisible by three is p₃ = (1 - p₂)/(1/3)²; the probability of otherwise being divisible by five is p₅ = (1 - p₂ - p₃)/(1/5)²; and so on.

The probability of the two numbers being coprime is q = 1 - p₂ - p₃ - p₅ - p₇ - ...

My maths are two rusty to be able to demonstrate that this series converges. But calculating its value for the list of first prime numbers available from Wikipedia seems to indicate that it does so towards 0.6080...

#!/usr/bin/perl

use strict;
use warnings;

my @p = (2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53,
         59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113,
         127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181,
         191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241, 251,
         257, 263, 269, 271, 277, 281, 283, 293, 307, 311, 313, 317,
         331, 337, 347, 349, 353, 359, 367, 373, 379, 383, 389, 397,
         401, 409, 419, 421, 431, 433, 439, 443, 449, 457, 461, 463,
         467, 479, 487, 491, 499, 503, 509, 521, 523, 541, 547, 557,
         563, 569, 571, 577, 587, 593, 599, 601, 607, 613, 617, 619,
         631, 641, 643, 647, 653, 659, 661, 673, 677, 683, 691, 701,
         709, 719, 727, 733, 739, 743, 751, 757, 761, 769, 773, 787,
         797, 809, 811, 821, 823, 827, 829, 839, 853, 857, 859, 863,
         877, 881, 883, 887, 907, 911, 919, 929, 937, 941, 947, 953,
         967, 971, 977, 983, 991, 997);

my $p = 1;
for (@p) {
    $p -= $p/($_*$_);
    print "$p\n";
}

sábado, 23 de junio de 2012

Room assignment

I am always delighted to see how problems like this are so beautifully solved using Prolog and CLP!

Bit manipulation trick

Given a 32bit integer number d and a bit position x, find an efficient way to shift to the left the bits of d on the positions from x+1 to 32 while leaving the bits on the positions from 0 to x untouched.

For instance, given d = 0b01001110 and x = 3, the return value should be 0b10010110.

Think about it before looking at the response here.

Cross polytopes

How to enumerate the surface points of a cross polytope laying on a multidimensional grid:

#!/usr/bin/perl

sub enumerate {
    my ($d, $r) = @_;

    if ($d == 1) {
        return ($r ? ([-$r], [$r]) : [0])
    }
    else {
        my @r;
        for my $i (0..$r) {
            for my $s (enumerate($d - 1, $r - $i)) {
                for my $j ($i ? (-$i, $i) : 0) {
                    push @r, [@$s, $j]
                }
            }
        }
        return @r;
    }
}

@ARGV == 2 or die "Usage:\n $0 dimension radius\n\n";
my ($d, $r) = @ARGV;

my @r = enumerate($d, $r);
print "[", join(',', @$_), "]\n" for @r;

This comes from this StackOverflow question. Initially I overlooked the paragraph where it states that the Manhattan distance is going to be used, so I solved it for the hypersphere defined using the euclidean distance.

BTW, the cross polytope is the euclidean-space equivalent of the n-sphere on the Manhattan-space.

This script enumerates the grid points in the n-ball (not just the surface of the n-sphere):

#!/usr/bin/perl

use strict;
use warnings;
use POSIX 'floor';

sub enumerate {
    my ($d, $r2) = @_;
    my $l = floor sqrt $r2;
    if ($d == 1) {
        return map [$_], -$l..$l;
    }
    else {
        my @r;
        for my $i (-$l..$l) {
            my @s = enumerate($d - 1, $r2 - $i * $i);
            push @$_, $i for @s;
            push @r, @s;
        }
        return @r;
    }
}

@ARGV == 2 or die "Usage:\n $0 dimension radius\n\n";
my ($d, $r) = @ARGV;

my @r = enumerate($d, $r * $r);
print "[", join(',', @$_), "]\n" for @r;

Well, the most interesting thing about that is that I have never noticed before that the cross-polytope is a generalization of the square into higher dimensions. Until now, my mind had always followed the sequence segment, square, cube, hypercube.

But, segment, square, octahedron, cross-polytope is also possible!!!

viernes, 22 de junio de 2012

Hexagonal puzzles, dynamic programming and polyominos.

I don't know why, but it seems that dynamic programming is not a popular technique among programmers.

In example, a couple of months ago, Carl Mäsak, an active member of the Perl and specially Perl 6 community, blogged about some puzzle played on an hexagonal grid and how to count the number of ways it could be filled with pieces of length two.

His approach was considering it an exact cover problem and using the dancing links algorithm to solve it. But exact cover is a NP-complete program, so even for the small board with 38 positions, Carl program took two weeks to complete.

Using dynamic programming this problem can be solved in a few milliseconds. I posted how to do it in Perl here. Then, Carl made an accurate and complete analysis of my script.

I keep playing with the script, modifying it to solve polyomino problems. For instance, the current version tries to complete a 12x12 grid using pentominos.

It's interesting to see how the number of parallel combinations that need to be examined explodes. In practice that means that the memory required also explodes. After all, dynamic programming is not a panacea and NP-complete problems are still hard.

A light and rough analysis makes me thing that dynamic programming transforms those 2D problems from the O(e^N) complexity and O(N) memory usage of the depth-first search approach into O(e^{N^1/2}) complexity and O(e^{N^1/2}) memory usage.

This question posted on PerlMonks a couple of years ago, can also be solved using the dynamic programming technique (I did it). What surprises me the most, is that nobody else tried to use that technique; and the thing is that the most brilliant minds in PerlMonks got hooked into that question!

Estimating run times

I have replied to this StackOverflow question.

The user wanted to know how to estimate the runtime of its O(N!) function in base to some meassurements performed for small values of N.

There were several errors in the way he was trying to fit a curve to the experimental data as not using the right curve or having too few data samples.

But the interesting thing about this node is that he was trying to predict the biggest N he would be able to run over the weekend and the response is that he doesn't need to!