jueves, 28 de junio de 2012

More natural sorting

I got an email this morning from somebody telling my that my Perl module Sort::Key::Natural was buggy because it was not sorting strings with embedded floating point numbers correctly.

Well, it's true, but that's not a bug, it is a feature!

The real issue is that there is not just one way to do natural sorting. On the contrary, there are several possibilities and depending on the problem you want to solve you will have to choose one or another.

Having said that, in my experience, when somebody asks for floating point number support, it's common that he would be just (ab)using natural sorting for sorting structured data that can be better sorted in other ways.

In any case, implementing a natural sorting function with float support is quite simple, here is how to do it using the key generation approach I explained on my previous post about natural sorting.

 

What is a float?

Before getting into the implementation details, first we have to define precisely what a floating point number ios going to be in our context.

The usual floating point representation is composed of an optional sign, followed by an integer part optionally followed by a dot and a fractional part and optionally followed by an exponent part. For instance, +12.3e2, 12.3E+2, 0.123E5, 12300, +12300.00 are all representations of the same number.

The first thing we are going to do is to get rid of the exponent part. We just do not support floating point numbers with an exponent part.

The problem here is not really the nuisance introduced by handling all this possible representations of a number but the ambiguity introduced by the letter e, that can be both part of a number and of a word.

Our second requirement is that the integer part can not be empty. For instance, .12 is not going to be recognized as 0.12 but as 12. That's because the dot is a common token separator as in foo.12.

In sumary, the floating point numbers we are going to support are those that match the regular expression /^[+\-]?\d+(\.\d*)$/

 

mkkey_natural

We start from the key generation function for natural sort with natural numbers:

  sub mkkey_natural {
    my $nat = shift;
    my @parts = $nat =~ /\d+|\p{IsAlpha}+/g;
    for (@parts) {
      if (/^\d/) {
        s/^0+//;
        my $len = length;
        my $nines = int ($len / 9);
        my $rest = $len - 9 * $nines;
        $_ = ('9' x $nines) . $rest . $_;
      }
    }
    return join("\0", @parts);
  }

 

The tokenizer

The first thing we should do is to modify the tokenizer to accept negative floating point numbers, using the following regular expression:

  $nat =~ /[+\-]?\d+(?:\.\d*)?|\p{IsAlpha}+/g;

 

The fractional part

Now, we go for the fractional part. Handling it is really simple, we only have to remove any leading zeros from its right side as, for instance, 0.1, 0.10 and 0.100 should be equivalent; then, append it to the key generated for the integer part.

It results on:

  if (my ($sign, $number, $dec) = /^([+-]?)(\d+)(?:\.(\d*))?$/) {
    $number =~ s/^0+//;
    $dec = '' unless defined $dec;
    $dec =~ s/0+$//;
    my $len = length $number;
    my $nines = int ($len / 9);
    my $rest = $len - 9 * $nines;
    $_ = ('9' x $nines) . $rest . $number . $dec;

  }

Note how two keys generated in this way are compared: the integer part is first compared by length, then by its digits and then the two fractional parts are compared by its digits from left to right.

Negative numbers

Negative numbers should always be ordered before positive integers, so we just prepend a minus sign to the key in case of a negative number (in ASCII, the minus character is placed before the digits).

Finally, in order to get negative numbers sorted in ascending order, we invert their digits replacing 0 by 9, 1 by 8, 2 by 7, etc. In Perl, we can use the tr operator for that:

  tr/0-9/9876543210/

 

Negative zero

There is still an special case we have to handle: 0 and -0 are the same number.

We add some conditional code to do it.

 

mkkey_natural_with_floats

And that's it...

  sub mkkey_natural_with_floats {
    my $nat = @_ ? shift : $_;
    my @parts = $nat =~ /[+\-]?\d+(?:\.\d*)?|\p{IsAlpha}+/g;
    for (@parts) {
      if (my ($sign, $number, $dec) = /^([+-]?)(\d+)(?:\.(\d*))?$/) {
        $number =~ s/^0+//;
        $dec = '' unless defined $dec;
        $dec =~ s/0+$//;
        my $len = length $number;
        my $nines = int ($len / 9);
        my $rest = $len - 9 * $nines;
        $_ = ('9' x $nines) . $rest . $number . $dec;
        if ($sign eq '-' and $_ ne '0') {
          tr/0123456789/9876543210/;
          $_ = "-$_";
        }
      }
    }
    return join("\0", @parts);

  }

lunes, 25 de junio de 2012

Coprime numbers

Given two random numbers from the interval [1,n], what is the probability of both numbers being coprimes when n→∞?

My intuition was telling me that the probability should be approaching 0, but it was completely wrong as it seems to converge to some value around 0.608!

Given the two integers a and b, they are not coprimes if they are both  divisible by two or if some of them is not divisible by two but they are both divisible by three or if some of them is not divisible by two, neither by three but they are both divisible by five and so on.

Numerically, the probability of being divisible both by two is p2 = (1/2)2; the probability of otherwise being divisible by three is p3 = (1 - p2)/(1/3)2; the probability of otherwise being divisible by five is p5 = (1 - p2 - p3)/(1/5)2; and so on.

The probability of the two numbers being coprime is q = 1 - p2 - p3 - p5 - p7 - ...

My maths are two rusty to be able to demonstrate that this series converges. But calculating its value for the list of first prime numbers available from Wikipedia seems to indicate that it does so towards 0.6080...



#!/usr/bin/perl

use strict;
use warnings;

my @p = (2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53,
         59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113,
         127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181,
         191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241, 251,
         257, 263, 269, 271, 277, 281, 283, 293, 307, 311, 313, 317,
         331, 337, 347, 349, 353, 359, 367, 373, 379, 383, 389, 397,
         401, 409, 419, 421, 431, 433, 439, 443, 449, 457, 461, 463,
         467, 479, 487, 491, 499, 503, 509, 521, 523, 541, 547, 557,
         563, 569, 571, 577, 587, 593, 599, 601, 607, 613, 617, 619,
         631, 641, 643, 647, 653, 659, 661, 673, 677, 683, 691, 701,
         709, 719, 727, 733, 739, 743, 751, 757, 761, 769, 773, 787,
         797, 809, 811, 821, 823, 827, 829, 839, 853, 857, 859, 863,
         877, 881, 883, 887, 907, 911, 919, 929, 937, 941, 947, 953,
         967, 971, 977, 983, 991, 997);

my $p = 1;
for (@p) {
    $p -= $p/($_*$_);
    print "$p\n";
}

sábado, 23 de junio de 2012

Room assignment

I am always delighted to see how problems like this are so beautifully solved using Prolog and CLP!

Bit manipulation trick

Given a 32bit integer number d and a bit position x, find an efficient way to shift to the left the bits of d on the positions from x+1 to 32 while leaving the bits on the positions from 0 to x untouched.

For instance, given d = 0b01001110 and x = 3, the return value should be 0b10010110.

Think about it before looking at the response here.

Cross polytopes

How to enumerate the surface points of a cross polytope laying on a multidimensional grid:

#!/usr/bin/perl

sub enumerate {
    my ($d, $r) = @_;

    if ($d == 1) {
        return ($r ? ([-$r], [$r]) : [0])
    }
    else {
        my @r;
        for my $i (0..$r) {
            for my $s (enumerate($d - 1, $r - $i)) {
                for my $j ($i ? (-$i, $i) : 0) {
                    push @r, [@$s, $j]
                }
            }
        }
        return @r;
    }
}

@ARGV == 2 or die "Usage:\n  $0 dimension radius\n\n";
my ($d, $r) = @ARGV;

my @r = enumerate($d, $r);
print "[", join(',', @$_), "]\n" for @r;

This comes from this StackOverflow question. Initially I overlooked the paragraph where it states that the Manhattan distance is going to be used, so I solved it for the hypersphere defined using the euclidean distance.

BTW, the cross polytope is the euclidean-space equivalent of the n-sphere on the Manhattan-space.

This script enumerates the grid points in the n-ball (not just the surface of the n-sphere):


#!/usr/bin/perl

use strict;
use warnings;
use POSIX 'floor';

sub enumerate {
    my ($d, $r2) = @_;
    my $l = floor sqrt $r2;
    if ($d == 1) {
        return map [$_], -$l..$l;
    }
    else {
        my @r;
        for my $i (-$l..$l) {
            my @s = enumerate($d - 1, $r2 - $i * $i);
            push @$_, $i for @s;
            push @r, @s;
        }
        return @r;
    }
}

@ARGV == 2 or die "Usage:\n  $0 dimension radius\n\n";
my ($d, $r) = @ARGV;

my @r = enumerate($d, $r * $r);
print "[", join(',', @$_), "]\n" for @r;


Well, the most interesting thing about that is that I have never noticed before that the cross-polytope is a generalization of the square into higher dimensions. Until now, my mind had always followed the sequence segment, square, cube, hypercube.

But, segment, square, octahedron, cross-polytope is also possible!!!


viernes, 22 de junio de 2012

Hexagonal puzzles, dynamic programming and polyominos.

I don't know why, but it seems that dynamic programming is not a popular technique among programmers.

In example, a couple of months ago, Carl Mäsak, an active member of the Perl and specially Perl 6 community, blogged about some puzzle played on an hexagonal grid and how to count the number of ways it could be filled with pieces of length two.

His approach was considering it an exact cover problem and using the dancing links algorithm to solve it. But exact cover is a NP-complete program, so even for the small board with 38 positions, Carl program took two weeks to complete.

Using dynamic programming this problem can be solved in a few milliseconds. I posted how to do it in Perl here. Then, Carl made an accurate and complete analysis of my script.

I keep playing with the script, modifying it to solve polyomino problems. For instance, the current version tries to complete a 12x12 grid using pentominos.

It's interesting to see how the number of parallel combinations that need to be examined explodes. In practice that means that the memory required also explodes. After all, dynamic programming is not a panacea and NP-complete problems are still hard.

A light and rough analysis makes me thing that dynamic programming transforms those 2D problems from the O(eN) complexity and O(N) memory usage of the depth-first search approach into O(eN1/2) complexity and O(eN1/2) memory usage.


This question posted on PerlMonks a couple of years ago, can also be solved using the dynamic programming technique (I did it). What surprises me the most, is that nobody else tried to use that technique; and the thing is that the most brilliant minds in PerlMonks got hooked into that question!

Estimating run times

I have replied to this StackOverflow question.

The user wanted to know how to estimate the runtime of its O(N!) function in base to some meassurements performed for small values of N.

There were several errors in the way he was trying to fit a curve to the experimental data as not using the right curve or having too few data samples.

But the interesting thing about this node is that he was trying to predict the biggest N he would be able to run over the weekend and the response is that he doesn't need to!


jueves, 21 de junio de 2012

Natural sorting

The other day, somebody asked in StackOverflow how to do "natural" sort of a list of strings containing both words and numbers.
For instance, given the following list of items:
  foo2 foo1 foo23 foo11 bar11 fo4
When ordered by the natural order of its elements it becomes:
  bar11 fo4 foo1 foo2 foo11 foo23
Roughly, the strings are tokenized in words and numbers and then words are compared as words and numbers as integers.
When a word is compared with an integer (i.e. when comparing "foo12" and "45bar") a predefined order is used. The usual custom is to place numbers before words, but it could be the other way if it fits your problem better.
Another point that should be considered is how to handle non alphanumeric characters. For instance, how do "foo12", "foo-12", and "foo.12" compare?
All the responses to the StackOverflow question concentrated on providing a custom comparison function, also, all the pages found when searching for "natural sort" on Google seem to focus on the same approach.
But, as the author of the Sort::Key family of Perl modules, I had a different background and so, my initial approximation to the problem was quite different. Let me explain briefly how sorting in Perl works...

 

Sorting in Perl (a digression)

In Perl, sorting is performed by a highly optimized mergesort function written in C than handles numeric and alphanumeric comparisons natively and that also supports using a Perl callback to compare the values. The problem is that using a custom Perl callback slows down the function to death because calling and running Perl code is several orders of magnitude slower than running a simple and optimized C function as strcmp.
A simple workaround for this performance issue is to generate for every element to be sorted a key that when sorted alphanumerically results in the same order of the custom comparison function applied to the former strings.
In order to generate the key we still have to call Perl code, but the number of times it is called is in the order of O(N) times whereas using a custom comparison function on the mergesort requires calling Perl code in the order of O(NlogN) times.
For an oversimplified example, given the integers (1, 0, 201, 72), we can generate the respective sorting keys ("001", "000", "201", "072"), that when sorted alphanumerically become ("000", "001", "072", "201"). Then we can reverse the mapping and obtain the list of integers sorted as (0, 1, 72, 201).
And that's roughly what my Perl modules do, they generate keys that can be sorted fast using a finite number of predefined C functions.
There are a lot of folklore around this Perl technique and its different variations, the more popular being the Schwartzian transform whose learning is like an initiating rite into the waters of intermediate Perl.

 

Generating sorting keys

Back to the StackOverflow question, after reading it, my thinking was that, right, you can write a fast natural comparison function in C (say, natcmp) and pass it to your favorite mergesort or quicksort implementation but in any case, this natcmp function is not going to be as fast as a raw strcmp. If we can do some preprocessing and then do less work inside the quicksort (let's assume we use quicksort) we could probably get a faster natural sort.
So, let's start by designing a way to generate a natural sorting key that can be compared lexicographically.
But first, as I explained above there are several possible natural orderings, we have to settle on an specific one.
For now, let's assume we want pairs of numbers compared as numbers and everything else compared in lexicographic order. For instance, we want "foo23" tokenized as ("foo", 23), "foo-23" as ("foo-", 23) and "foo..23" as ("foo..", 23). In the end, this just reduces to breaking the strings in substrings containing just digits and substrings containing everything else.
The simplest way to generate a sorting key for that order is probably to pad every numeric substring with zeros at its left up to the length of the longest numeric substring. For instance, given ("foo1", "foo23", "bar23456"), the longest numeric substring is "23456" being its length 5, so we can pad every other numeric substring to this length resulting in ("foo00001", "foo00023", "bar23456"). It is easy to see that when these keys are compared lexicographically, it results in the desired order. The drawback of using this key generation is that the memory usage can be quite high.
A better way to generate the keys is to prefix every numeric substring with its length. For instance ("1", "23", "123") become ("11", "223", "3123"). Again, it is easy to see that these also sorts in the desired order... most of the time. It breaks in two cases: first, when some numeric strings begins by one or more zeros. I.e. ("0001", "23") becomes ("40001", "223") that is not sorted correctly as numerically "0001" is less than "23". But this problem is easy to workaround, just check for that condition and drop any zeros at the left of the substring. That even works for the "0" string that is converted to "0".
The other case where this key generation fails is when it founds numeric substrings longer than 9 digits. For instance, given ("1234567890", "123"), the keys generated are ("101234567890", "3123") resulting on the wrong order.
Overcoming this problem requires a little trick: when the length of the numeric substring is longer than 8 digits, we prefix the substring by the digit "9" repeated int(length/9) times followed by the digit representing length%9. For instance, if we have a string with 34 digits, then as int(34/9) is 3 and 34%9 is 7, we prefixed it by "9997".
On the worst case, the length of the keys generated by this method is 3/2 of that of the former string.

 

Implementation technicalities

The more difficult decision I have to take while implementing the key generation stuff was how to allocate memory for the keys. I didn't want to waste memory allocating significantly more memory than the strictly required. I didn't want to call malloc once for every single key either as that would degrade performance.
I considered preallocating memory for blocks of keys estimating approximately the memory required or some maximum and several other variations, but in all the cases I was able to come with some edge case where it would perform badly.
In the end I decided to go for a two pass approach. In the first pass, the key lengths are computed and then the total required memory is allocated. In the second pass, the keys are calculated and stored in the already allocated memory.
The two pass approach may seem more expensive as keys have to be generated twice, but estimating the keys length would require some processing anyway and on the second key, as we now that enough memory is available we don't need to bother checking for overflows.
Anyway, a possible improvement would be to perform the two passes generation in blocks. That would require more calls to malloc (one for block) but on the other hand, if the size of the blocks is carefully adjusted so that the data fits in the L2 cache, the key calculation would probably run quite faster.
Another interesting thing is how to do the reverse mapping from the ordered array of keys to the former strings. My solution has been to store a pointer to the corresponding former string just before every every key. That way the reverse mapping runs is a O(N) task.

 

A natcmp function

Ok, now, we have our sorting function and we think that it should be faster than the common approach, but how much faster? we need to run some benchmarks... but wait, where is the highly optimized natcmp function we are going to use for comparison? Nowhere, we have to create our own...
We start with the simple straigh-forward approach: loop over the two strings s1 and s2 in parallel, when any of the two characteres being compared is not a digit, perform a normal comparison, if they are equal we advance to the following character, if they are not equal we return -1 or 1 appropriately.
If the two characters are numbers, we have to compare then numerically: first, discard any leading zeros, then check the length of the numeric substrings. If they are of different lenght then we can conclude that the string with the shorted length is to be ordered before the other. If the two numeric substrings have the same length, we compare then character by character.
Here there is a C implementation:
  int
  natcmp(const char *s1, const char *s2) {
    char c1, c2;
    while (1) {
      c1 = *(s1++);
      c2 = *(s2++);
      if ((c1 <= '9') && (c2 <= '9') && (c1 >= '0') && (c2 >= '0')) {
        /* so, both c1 and c2 are digits */
        size_t l1, l2;
        const char *e1, *e2;
        /* skip leading zeros */
        while (c1 == '0') c1 = *(s1++);
        while (c2 == '0') c2 = *(s2++);
        /* find the end of the numeric substrings */
        for (e1 = s1; (c1 >= '0') && (c1 <= '9'); c1 = *(e1++));
        for (e2 = s2; (c2 >= '0') && (c2 <= '9'); c2 = *(e2++));
        /* if the lengths are different we can conclude the result */
        if (e1 - s1 < e2 - s2) return -1;
        if (e1 - s1 > e2 - s2) return 1;
        /* otherwise we just compare them as strings */
        s1--; s2--; e1--;
        while (s1 < e1) {
          c1 = *(s1++);
          c2 = *(s2++);
          if (c1 < c2) return -1;
          if (c1 > c2) return 1;
        }
      }
      /* at least one of the characters is not a digit */
      else if (c1 < c2) return -1;
      else if (c1 > c2) return 1;
      else if (c1 == '\0') return 0;
    }
  } 

 

A better natcmp function

We have a working natcmp function, the problem is that at every position it has to check it the characters are both digits or not. In relative terms, this is a lot of work.
Let's assume an optimistic approach, ignore that we may have found a number and just look for the point where the two strings diverge. When we reach this point we know that to the left both strings are identical. There may be digits to the left of the divergence point or no, we don't know
Well, at this point you have to realice that what is at the left doesn't really matter (most of the times). For instance, comparing ("foo123", "foo1245"), when we reach the divergence point we have ("3", "45") left, we can see that they are numbers, compare them numerically and conclude than the first goes before the second.
Erhh, most of the times, right. The problem again is the leading zeros. If any of the characters at the divergence point is a zero we have to look back and see if this is a number with leading zeros. If it is, we have to discard them. For instance, given ("foo0112", "foo0002"), at the divergence point we have ("112", "002"), we see that one of the characteres is a zero, so we lock back and see that all the digits to the left are zeros so we have to discard them. After that we have the strings ("112", "2") that when compared as numbers result in the first going after the second.
Here there is the C implementation:
  int
  natcmp_alt(const char *s1, const char *s2) {
    const char *start1 = s1;
    const char *start2 = s2;
    char c1, c2;
    while (1) {
      char c1 = *(s1++);
      char c2 = *(s2++);
      if (c1 == c2) {
        if (c1 == '\0') return 0;
      }
      else {
        /* we are at the divergence point */
        if ((c1 <= '9') && (c2 <= '9') && (c1 >= '0') && (c2 >= '0')) {
          /* both characters at the divergence point are digits */
          /* if any of them is a zero we have to perform the leading zeros check */
          if (c1 == '0') {
            const char *p = s1 - 2;
            while ((p >= start1) && (*p == '0')) p--;
            if ((p < start1) || ((*p < '0') && (*p > '9')))
              while ((c1 = *(s1++)) == '0');
            if ((c1 < '0') || (c1 > '9'))
              return -1;
          }
          else if (c2 == '0') {
            const char *p = s2 - 2;
            while ((p >= start2) && (*p == '0')) p--;
            if ((p < start2) || ((*p < '0') && (*p > '9')))
              while ((c2 = *(s2++)) == '0');
            if ((c2 < '0') || (c2 > '9'))
              return 1;
          }
          /* numerical comparison: first by length then by the diverging digit */
          while (1) {
            char d1 = *(s1++);
            char d2 = *(s2++);
            if ((d1 < '0') || (d1 > '9'))
              return ((c1 > c2) && ((d2 < '0') || (d2 > '9')) ? 1 : -1);
            if ((d2 < '0') || (d2 > '9'))
              return 1;
          }
       }
       /* they are not digits */
       return (c1 < c2 ? -1 : 1);
     }
   }
 }
 

 

The code

The full code discussed on this post is available from GitHub: https://github.com/salva/c-natsort.
It includes several variations of the comparison functions. The version using the key pre-generation algorithm also supports sorting ignoring non alphanumeric characters.
Some variations of the strcmp function used in the quicksort are also included. In theory, GCC provides a built-in for strcmp that should be blazing fast, but my benchmarks show that depending of the data and the optimization flags, my variations sometimes are faster. Check for yourself and see.

 

Benchmarking

So let's see how the two exposed approaches to natural sorting perform.
The script bm.pl is a driver that runs the natsort program with the different flags that select the sorting algorithm and implementation, times the running time and generates a nice table with the results.
After benchmarking the different algorithms with different data sets, these are my findings:
  • Using the key pre-generation approach is usually a little faster, but not too much, typically ranging from 0% to 15%.
  • The alternative natcmp function is slightly faster than the simple version, around 10% faster. 

 

Conclusion

For natural sorting strings in C, there is not much point in using the key pre-generation approach as it only gives a small performance improvement and on the other hand uses a lot of memory.