The
other day, somebody
asked in StackOverflow how to do "natural"
sort of a list of strings containing both words and numbers.
For
instance, given the following list of items:
foo2
foo1 foo23 foo11 bar11 fo4
When
ordered by the natural order of its elements it becomes:
bar11
fo4 foo1 foo2 foo11 foo23
Roughly,
the strings are tokenized in words and numbers and then words are
compared as words and numbers as integers.
When
a word is compared with an integer (i.e. when comparing "foo12"
and "45bar") a predefined order is used. The usual custom
is to place numbers before words, but it could be the other way if it
fits your problem better.
Another
point that should be considered is how to handle non alphanumeric
characters. For instance, how do "foo12", "foo-12",
and "foo.12" compare?
All
the responses to the StackOverflow question concentrated on
providing a custom comparison function, also, all the pages found when searching for
"natural sort" on Google seem to focus on the same approach.
But,
as the author of the
Sort::Key family of Perl modules, I had a
different background and so, my initial approximation to the problem was quite
different. Let me explain briefly how sorting in Perl works...
Sorting in Perl (a digression)
In
Perl, sorting is performed by a highly optimized
mergesort function written in C than handles numeric and alphanumeric comparisons
natively and that also supports using a Perl callback to compare the
values. The problem is that using a custom Perl callback slows down
the function to death because calling and running Perl code is
several orders of magnitude slower than running a simple and
optimized C function as
strcmp.
A
simple workaround for this performance issue is to generate for every element to be
sorted a key that when sorted alphanumerically results in the same
order of the custom comparison function applied to the former
strings.
In
order to generate the key we still have to call Perl code, but the
number of times it is called is in the order of O(N) times whereas
using a custom comparison function on the mergesort requires calling
Perl code in the order of O(NlogN) times.
For
an oversimplified example, given the integers (1, 0, 201, 72), we can
generate the respective sorting keys ("001", "000",
"201", "072"), that when sorted alphanumerically
become ("000", "001", "072", "201").
Then we can reverse the mapping and obtain the list of integers
sorted as (0, 1, 72, 201).
And
that's roughly what my Perl modules do, they generate keys that can be sorted
fast using a finite number of predefined C functions.
There are a lot of folklore around this Perl technique and its different variations, the more popular being the
Schwartzian transform whose learning is like an initiating rite into the waters of intermediate Perl.
Generating
sorting keys
Back
to the StackOverflow question, after reading it, my thinking was
that, right, you can write a fast natural comparison function in C
(say,
natcmp) and pass it to your favorite
mergesort or
quicksort
implementation but in any case, this
natcmp function is not going to
be as fast as a raw
strcmp. If we can do some preprocessing and then
do less work inside the quicksort (let's assume we use quicksort) we
could probably get a faster natural sort.
So,
let's start by designing a way to generate a natural sorting key that
can be compared lexicographically.
But
first, as I explained above there are several possible natural
orderings, we have to settle on an specific one.
For
now, let's assume we want pairs of numbers compared as numbers and everything else compared in lexicographic order. For instance, we want "foo23"
tokenized as ("foo", 23), "foo-23" as ("foo-",
23) and "foo..23" as ("foo..", 23). In the end, this
just reduces to breaking the strings in substrings containing just
digits and substrings containing everything else.
The
simplest way to generate a sorting key for that order is probably to
pad every numeric substring with zeros at its left up to the length
of the longest numeric substring. For instance, given ("foo1",
"foo23", "bar23456"), the longest numeric
substring is "23456" being its length 5, so we can pad
every other numeric substring to this length resulting in
("foo00001", "foo00023", "bar23456"). It is easy to see that when these keys are compared
lexicographically, it results in the desired order. The drawback of
using this key generation is that the memory usage can be quite high.
A
better way to generate the keys is to prefix every numeric substring
with its length. For instance ("1", "23", "123")
become ("11", "223", "3123"). Again, it is
easy to see that these also sorts in the desired order... most of the
time. It breaks in two cases: first, when some numeric strings begins
by one or more zeros. I.e. ("0001", "23") becomes
("40001", "223") that is not sorted correctly as
numerically "0001" is less than "23". But this problem is
easy to workaround, just check for that condition and drop any zeros
at the left of the substring. That even works for the "0"
string that is converted to "0".
The
other case where this key generation fails is when it founds numeric
substrings longer than 9 digits. For instance, given ("1234567890",
"123"), the keys generated are ("101234567890",
"3123") resulting on the wrong order.
Overcoming this problem requires a little trick: when the length of the numeric
substring is longer than 8 digits, we prefix the substring by the
digit "9" repeated int(length/9) times followed by the digit
representing length%9. For instance, if we have a string with 34
digits, then as int(34/9) is 3 and 34%9 is 7, we prefixed it by
"9997".
On
the worst case, the length of the keys generated by this method is
3/2 of that of the former string.
Implementation technicalities
The
more difficult decision I have to take while implementing the key generation
stuff was how to allocate memory for the keys. I didn't want
to waste memory allocating significantly more memory than the
strictly required. I didn't want to call
malloc once for every
single key either as that would degrade performance.
I
considered preallocating memory for blocks of keys estimating
approximately the memory required or some maximum and several other
variations, but in all the cases I was able to come with some edge
case where it would perform badly.
In
the end I decided to go for a two pass approach. In the first pass,
the key lengths are computed and then the total required memory is
allocated. In the second pass, the keys are calculated and stored in
the already allocated memory.
The
two pass approach may seem more expensive as keys have to be
generated twice, but estimating the keys length would require some
processing anyway and on the second key, as we now that enough memory
is available we don't need to bother checking for overflows.
Anyway,
a possible improvement would be to perform the two passes generation
in blocks. That would require more calls to malloc (one for block)
but on the other hand, if the size of the blocks is carefully
adjusted so that the data fits in the L2 cache, the key calculation
would probably run quite faster.
Another
interesting thing is how to do the reverse mapping from the ordered
array of keys to the former strings. My solution has been to store a
pointer to the corresponding former string just before every every
key. That way the reverse mapping runs is a O(N) task.
A natcmp function
Ok,
now, we have our sorting function and we think that it should be
faster than the common approach, but how much faster? we need to run
some benchmarks... but wait, where is the highly optimized natcmp
function we are going to use for comparison? Nowhere, we have to
create our own...
We
start with the simple straigh-forward approach: loop over the two
strings s1 and s2 in parallel, when any of the two characteres being
compared is not a digit, perform a normal comparison, if they are
equal we advance to the following character, if they are not equal we
return -1 or 1 appropriately.
If
the two characters are numbers, we have to compare then numerically: first, discard any leading zeros, then check the length of the
numeric substrings. If they are of different lenght then we can conclude that the string with the shorted length is to be ordered before
the other. If the two numeric substrings have the same length, we
compare then character by character.
Here
there is a C implementation:
int
natcmp(const char *s1, const char *s2) {
char c1, c2;
while (1) {
c1 = *(s1++);
c2 = *(s2++);
if ((c1 <= '9') && (c2 <= '9') && (c1 >=
'0') && (c2 >= '0')) {
/* so, both c1 and
c2 are digits */
size_t l1, l2;
const
char *e1, *e2;
/* skip leading zeros */
while (c1 == '0') c1 = *(s1++);
while (c2 == '0')
c2 = *(s2++);
/* find the end of the numeric
substrings */
for (e1 = s1; (c1 >= '0') &&
(c1 <= '9'); c1 = *(e1++));
for (e2 = s2; (c2 >=
'0') && (c2 <= '9'); c2 = *(e2++));
/* if
the lengths are different we can conclude the result */
if (e1 - s1 < e2 - s2) return -1;
if (e1 - s1 >
e2 - s2) return 1;
/* otherwise we just compare them
as strings */
s1--; s2--; e1--;
while
(s1 < e1) {
c1 = *(s1++);
c2 = *(s2++);
if (c1 < c2) return -1;
if (c1 > c2) return 1;
}
}
/* at least one of the characters is not a digit */
else if (c1 < c2) return -1;
else if (c1 >
c2) return 1;
else if (c1 == '\0') return 0;
}
}
A better natcmp function
We have a working natcmp function, the problem is that at
every position it has to check it the characters are both digits or
not. In relative terms, this is a lot of work.
Let's assume an optimistic approach, ignore that we may have found a
number and just look for the point where the two strings diverge.
When we reach this point we know that to the left both strings are
identical. There may be digits to the left of the divergence point
or no, we don't know
Well,
at this point you have to realice that what is at the left doesn't
really matter (most of the times). For instance, comparing ("foo123",
"foo1245"), when we reach the divergence point we have
("3", "45") left, we can see that they are numbers,
compare them numerically and conclude than the first goes before the
second.
Erhh,
most of the times, right. The problem again is the leading zeros. If any
of the characters at the divergence point is a zero we have to look
back and see if this is a number with leading zeros. If it is, we
have to discard them. For instance, given ("foo0112",
"foo0002"), at the divergence point we have ("112",
"002"), we see that one of the characteres is a zero, so we
lock back and see that all the digits to the left are zeros so
we have to discard them. After that we have the strings ("112", "2") that when compared as numbers result in the first going after the second.
Here
there is the C implementation:
int
natcmp_alt(const char *s1, const char *s2) {
const char
*start1 = s1;
const char *start2 = s2;
char c1, c2;
while (1) {
char c1 = *(s1++);
char c2 =
*(s2++);
if (c1 == c2) {
if (c1 == '\0')
return 0;
}
else {
/* we are
at the divergence point */
if ((c1 <= '9') &&
(c2 <= '9') && (c1 >= '0') && (c2 >= '0')) {
/* both characters at the divergence point are
digits */
/* if any of them is a zero we have to
perform the leading zeros check */
if (c1 == '0')
{
const char *p = s1 - 2;
while ((p >= start1) && (*p == '0')) p--;
if ((p < start1) || ((*p < '0') && (*p >
'9')))
while ((c1 = *(s1++)) == '0');
if ((c1 < '0') || (c1 > '9'))
return -1;
}
else if (c2 == '0') {
const char *p = s2 - 2;
while ((p >= start2) && (*p ==
'0')) p--;
if ((p < start2) || ((*p <
'0') && (*p > '9')))
while
((c2 = *(s2++)) == '0');
if ((c2 < '0') ||
(c2 > '9'))
return 1;
}
/* numerical comparison: first by length
then by the diverging digit */
while (1) {
char d1 = *(s1++);
char d2 =
*(s2++);
if ((d1 < '0') || (d1 > '9'))
return ((c1 > c2) && ((d2 < '0') || (d2 > '9')) ? 1
: -1);
if ((d2 < '0') || (d2 > '9'))
return 1;
}
}
/*
they are not digits */
return (c1 < c2 ? -1 : 1);
}
}
}
The code
It includes several variations of the comparison functions. The version using the key pre-generation algorithm also supports sorting ignoring non alphanumeric characters.
Some variations of
the strcmp function used in the quicksort are also included. In theory, GCC provides a
built-in for strcmp that should be blazing fast, but my benchmarks
show that depending of the data and the optimization flags, my
variations sometimes are faster. Check for yourself and see.
Benchmarking
So
let's see how the two exposed approaches to natural sorting perform.
The
script bm.pl is a driver that runs the natsort program with the
different flags that select the sorting algorithm and
implementation, times the running time and generates a nice table with
the results.
After
benchmarking the different algorithms with different data sets, these are my findings:
- Using
the key pre-generation approach is usually a little faster, but not
too much, typically ranging from 0% to 15%.
- The
alternative natcmp function is slightly faster than the simple
version, around 10% faster.
Conclusion
For
natural sorting strings in C, there is not much point in using the
key pre-generation approach as it only gives a small performance improvement and on the other hand uses a lot of memory.