jueves, 28 de junio de 2012

More natural sorting

I got an email this morning from somebody telling my that my Perl module Sort::Key::Natural was buggy because it was not sorting strings with embedded floating point numbers correctly.

Well, it's true, but that's not a bug, it is a feature!

The real issue is that there is not just one way to do natural sorting. On the contrary, there are several possibilities and depending on the problem you want to solve you will have to choose one or another.

Having said that, in my experience, when somebody asks for floating point number support, it's common that he would be just (ab)using natural sorting for sorting structured data that can be better sorted in other ways.

In any case, implementing a natural sorting function with float support is quite simple, here is how to do it using the key generation approach I explained on my previous post about natural sorting.

 

What is a float?

Before getting into the implementation details, first we have to define precisely what a floating point number ios going to be in our context.

The usual floating point representation is composed of an optional sign, followed by an integer part optionally followed by a dot and a fractional part and optionally followed by an exponent part. For instance, +12.3e2, 12.3E+2, 0.123E5, 12300, +12300.00 are all representations of the same number.

The first thing we are going to do is to get rid of the exponent part. We just do not support floating point numbers with an exponent part.

The problem here is not really the nuisance introduced by handling all this possible representations of a number but the ambiguity introduced by the letter e, that can be both part of a number and of a word.

Our second requirement is that the integer part can not be empty. For instance, .12 is not going to be recognized as 0.12 but as 12. That's because the dot is a common token separator as in foo.12.

In sumary, the floating point numbers we are going to support are those that match the regular expression /^[+\-]?\d+(\.\d*)$/

 

mkkey_natural

We start from the key generation function for natural sort with natural numbers:

  sub mkkey_natural {
    my $nat = shift;
    my @parts = $nat =~ /\d+|\p{IsAlpha}+/g;
    for (@parts) {
      if (/^\d/) {
        s/^0+//;
        my $len = length;
        my $nines = int ($len / 9);
        my $rest = $len - 9 * $nines;
        $_ = ('9' x $nines) . $rest . $_;
      }
    }
    return join("\0", @parts);
  }

 

The tokenizer

The first thing we should do is to modify the tokenizer to accept negative floating point numbers, using the following regular expression:

  $nat =~ /[+\-]?\d+(?:\.\d*)?|\p{IsAlpha}+/g;

 

The fractional part

Now, we go for the fractional part. Handling it is really simple, we only have to remove any leading zeros from its right side as, for instance, 0.1, 0.10 and 0.100 should be equivalent; then, append it to the key generated for the integer part.

It results on:

  if (my ($sign, $number, $dec) = /^([+-]?)(\d+)(?:\.(\d*))?$/) {
    $number =~ s/^0+//;
    $dec = '' unless defined $dec;
    $dec =~ s/0+$//;
    my $len = length $number;
    my $nines = int ($len / 9);
    my $rest = $len - 9 * $nines;
    $_ = ('9' x $nines) . $rest . $number . $dec;

  }

Note how two keys generated in this way are compared: the integer part is first compared by length, then by its digits and then the two fractional parts are compared by its digits from left to right.

Negative numbers

Negative numbers should always be ordered before positive integers, so we just prepend a minus sign to the key in case of a negative number (in ASCII, the minus character is placed before the digits).

Finally, in order to get negative numbers sorted in ascending order, we invert their digits replacing 0 by 9, 1 by 8, 2 by 7, etc. In Perl, we can use the tr operator for that:

  tr/0-9/9876543210/

 

Negative zero

There is still an special case we have to handle: 0 and -0 are the same number.

We add some conditional code to do it.

 

mkkey_natural_with_floats

And that's it...

  sub mkkey_natural_with_floats {
    my $nat = @_ ? shift : $_;
    my @parts = $nat =~ /[+\-]?\d+(?:\.\d*)?|\p{IsAlpha}+/g;
    for (@parts) {
      if (my ($sign, $number, $dec) = /^([+-]?)(\d+)(?:\.(\d*))?$/) {
        $number =~ s/^0+//;
        $dec = '' unless defined $dec;
        $dec =~ s/0+$//;
        my $len = length $number;
        my $nines = int ($len / 9);
        my $rest = $len - 9 * $nines;
        $_ = ('9' x $nines) . $rest . $number . $dec;
        if ($sign eq '-' and $_ ne '0') {
          tr/0123456789/9876543210/;
          $_ = "-$_";
        }
      }
    }
    return join("\0", @parts);

  }

No hay comentarios:

Publicar un comentario