Sunday, October 7, 2012

Bucket Sorting with Maps

For most sorting needs, general purpose algorithms like quicksort or merge sort will fit the bill quite nicely and are often provided as part of a language's standard library. Other times, however, other sorting algorithms can greatly outperform the generic algorithms. One case where the general-purpose algorithms is when there are for more records than unique values to sort by; e.g. if you're sorting a customer list by state or country. In those cases, we can reach for bucket sort, which is stable and can sort a collection in linear time, potentially with no comparisons.

Basically, the way that bucket sort works is that we split the things to be sorted into progressively smaller buckets until either the input is sorted to the granularity that we want or the bucket is small enough that an algorithm like insertion sort will be faster. As an example, if you're sorting by dates you could break it down into buckets by year, since every date in 1990 will be earlier than every date in 1991. Since no comparisons are needed to split the input into buckets, it can be done in linear time. If, then, a particular year had a lot of entries, we can split it up by month and then by day of the month, if need-be.

When we have discrete values, like in the states example above, we can take the straight-forward approach and just use the values as buckets. It's certainly possible to implement a custom way of mapping keys to buckets, but it's easier to use the map implementation that comes with your programming language. In Java, that leaves us with a decision to make between using a TreeMap or a HashMap. At first blush, it looks like TreeMap would be the obvious choice, since we could just do the inserts and then iterate over its values. If we look closer at the expected runtime, though, we see that the hash map implementation should be faster, since each insert involves a log num-keys tree search.

Here's an example implementation using a hash map.


The KeyExtractor argument is just a simple interface that, given an item, returns the key for that item.
Given that the whole point of this exercise was to beat the performance of a generic sort, we should see how it performs. To test that, I generated a 10,000 record list with 50 distinct keys and then sorted a copy of it using Collections.sort and the bucket sort 10,001 times to give the optimizer a chance to run. Then I ran the sorts individually 100 times, capturing the fastest & slowest times and the total time (all in nanoseconds). The results are summarized below:
RunMin timeMax timeTotal time
Collections.sort1,589,7053,290,003167,857,942
Bucket sort191,537243,98120,913,723

The full source code is available on git at https://github.com/adbrowning/blog-code/blob/master/src/com/adbrowning/sort/HashMapSort.java