Monday, October 18, 2010

Algorithms and surprises

The other day, there was a question on StackOverflow that basically boiled down to, "Why is my O(n*n) algorithm running so slow?" What the individual was trying to do was find all of the elements in one array that exist in another array. It didn't take long for someone to post that the person should sort the lists and iterate over them once, but that got me thinking. Would it be faster to do a linear traversal, or would doing a binary search back and forth between the two lists be faster? It seems like it should be, since it's O(log(n)) to find the next term, instead of a potential O(n), but I wanted to make sure. The answer was surprising, to say the least. The O(n*n) algorithm was by far the slowest, but the fastest turned out to be the linear traversal.

Cumulative times (in ms) for 100 iterations of each of the approaches were (for input arrays of size 1,000, 2,000, 4,000, 8,000, 16,000, 32,000 & 64,000 elements respectively):
O(n*n): [52, 174, 727, 2698, 10648, 42516, 171441]
Linear: [43, 36, 74, 161, 332, 718, 1484]
Binary Search: [28, 41, 87, 182, 385, 825, 1727]

Actually looking at the data (all arrays populated by a call to Random.nextInt) revealed the answer. None of the binary searches were allowing the index for either array to jump by more than 2 or 3 entries at a time, meanwhile the linear approach wasn't examining more than 2 or 3 entries before finding the value (or a higher value to force a switch to the other list). The element being sought being so near the lower-end of the array made the binary search evaluate many more elements than the linear search.

The take-home from this test is that it's important to test your assumptions when it comes to performance. Intuitively, an algorithm with an expected runtime log(n)*log(n) should be much faster than one with an expected runtime of 2n, but actual data distribution may make it not so.

No comments: