Sunday, June 23, 2013

CS 101 Part 2

It's time for the (probably not so) eagerly awaited part 2 of my intro to comp. sci series! We're moving on now to the algorithms which, despite their name, have nothing to do with the former vice president's musical stylings. Algorithms are basically recipes for doing things, and the word comes from the name of the guy who gave us algebra (al-Khwarizmi). Since computers spend an inordinate amount of their time sorting, that's where we'll start, after a brief detour into Big-O notation.

Big O Notation

For most things that you need a computer to do, it takes more time the more things you need it to operate on, for example it takes longer to calculate the sum of 100 numbers than it does for 10; makes sense, right? What Big O notation gives us is a way to specify an upper bound on how badly an algorithm slows down as the number of elements that it has to consider grows. Some examples might help make that clearer.  Calculating the sum of n numbers takes O(n) time, since the time grows linearly with the number of elements (we say that it runs in linear time, for obvious reasons). If you want to calculate the result of multiplying every number in that list by every other number in it, that's O(n2) since for all n elements, you have to perform n calculations, so the total number of calculations is n * n.

To simplify things, with Big O notation we focus on the dominating aspect of the algorithm. By that, I mean we focus on the aspect of the algorithm that tends to chew up the most time. For example, if you have x + y = z, and x is 0.00001 and y is 1,000,000 very nearly all of the value of z is determined by y, so we'd say that y dominates. Likewise, n3 dominates n2, which dominates n. Focusing on the dominating component lets us simplify by dropping the less significant portions, e.g. O(5n3+ 975n2+ 8,000,000) is simplified to just O(n3) since, as n approaches infinity, the n3 portion will dominate the other components.

It's important to note that Big O notation does not tell us which of two algorithms is faster, it just gives us an idea for how an algorithm will behave as the input size grows. For example, there are some algorithms for operations on matrices that have asymptotically better behavior (that is, behavior as the input size approaches infinity) than the algorithms that are generally used. The catch, though, is that have hidden constants so large that they're slower than the simpler algorithm for any input size smaller than every molecule in the known universe.

Some runtimes that you'll need to know because they come up frequently are:
O(n)
Linear time; if you double the input size, the runtime doubles and if you square it, then the runtime squares
O(n log n)
The log is base-2, which means that to see an increase of 1, n has to be doubled
O(n2)
Quadratic time; the runtime grows as the square of its input size, so if you double n, the runtime goes up 4 times; quadruple it and the runtime goes up 16 times.

Sorting

It's at about this point that most books introduce bubble sort. I'm not sure why, since it's fairly complicated, not particularly intuitive and it has a worst-case runtime of O(n2). Instead, I'm going to introduce insertion sort and selection sort, both of which are fairly intuitive since they're the way that humans sort. Most of the sorts that we'll be examining depend on this one observation: a list of size one is always sorted.

Selection Sort

The idea behind selection sort is that you look through the list looking for the smallest element. Once you find it, you put it at the head of the list. Then you look through the list for the 2nd smallest element and put it in the 2nd position. Just keep repeating that until you're done. Selection sort is simple and easy to implement; it also happens to take O(n2) time, so it's fairly slow on large lists.

Insertion Sort

Insertion sort is not vastly different from selection sort, even though it's conceptually working in exactly the opposite way. With insertion sort, you take each element, find the position in the list where it belongs, and put it there. This depends on that key observation before, that a list of size one is already sorted; we use that to say that the first element of the list is sorted and then we proceed to extend the length of that list one element at a time. Insertion sort is simple, easy to implement and really fast when the list is small, but its worst-case runtime (when the list is in exactly the opposite order of what you want) is O(n2), so it can be horrendously slow.

We can make insertion sort a little bit more complicated, though, and have its runtime approach linear the closer it gets to being sorted. If we only look for the proper position of an element if it's not already in its proper position, then a sorted list will be "sorted" in linear time. To do that, we only look for the proper position of an element if it's smaller than the element before it (assuming you're sorting from smallest to largest; otherwise look for its proper position if it's larger than the element before it). You could also use binary search to limit its worst case to O(n log n), but if you do so you'd lose all of insertion sort's benefits.

Divide and Conquer Sorting Algorithms

Now we're getting into a class of sorting algorithms that trade simplicity and obviousness for speed. The underlying observation that makes all of them work in theory is the one above, that a list of size one is always sorted (in practice, we usually get down to a sublist of a couple dozen elements then use insertion sort on those). Building from that observation, we can say that a list is comprised of a lot of sorted sub-lists that we just need to merge. All of the following algorithms boil down to how do we split the list into sublists and how do we merge them back together. 

Quick Sort


With quick sort, we pick some value from the list, called a pivot. Then, we split the list into two lists, one containing all of the elements less than the pivot, the other containing all of the elements greater than the pivot, then do the same to each of the sub-lists. This will eventually get you to the point where all of the sublists are sorted, because they all contain 1 term. In terms of implementation, start from the first element of the list moving toward the end until you find an element larger than the pivot; then starting from the end move towards the beginning until you find an element smaller than the pivot and then swap them. When the indexes meet, you've got the lists split and it's time to do the same to the two sublists.

The runtime of quick sort is a bit tricky. Its average case run time is O(n log n), but it's worst case is O(n2). Fortunately, we rarely see its worst-case (there are tricks that can be pulled to make the odds of seeing its worst case 1/n!; that's n-factorial, not just an exclamation mark).

Quick sort is a good default sorting algorithm because it's fairly simple & cache friendly and, since the splitting phase does all of the work, the merge phase is trivial.

Merge Sort

Where quick sort does all of its work in the split phase, merge sort does all of its work in the merge phase. The split phase just splits the list in half and then sorts the two sublists with merge sort. Doing that repeatedly will eventually get you down to n sorted sublists, each with one element. The merge phase proceeds by taking two sorted sublists and combining them into one sorted list by taking the smallest element of the two lists and putting it in the next position in the combined list until both lists are exhausted. 

Merge sort has a very simple run-time to consider; its best, worst and average case times are all O(n log n). 

Merge sort has the advantages of being pretty simple to implement and cache friendly, but it has the disadvantage of requiring additional memory proportional to the amount taken by the list to be sorted. There are ways of implementing merge sort that do not require additional memory, but they are more complicated and slower (they're not slower in Big-O terms, but they are slower in practice). Merge sort also has the nice property that it's a stable sort (if the list contains two elements that compare equally, the first one will end up first; e.g. if you want students sorted by age and then, within an age bracket by name, you can sort by name and then sort by age and it'll line up properly). Finishing off the list of nice properties of merge sort is that its cache-friendliness extends to disk; if your data is too large to fit into RAM, you're probably going to be implementing at least the merge phase of merge sort on disk.

Heap Sort

This is a bit of an oddity in that it's not a divide-and-conquer algorithm, it just exploits the properties of a heap. Insert all of the elements to be sorted into a heap, then remove them and they'll be ordered. Like the other comparison-based sorts, heap sort is O(n log n), though all of its other attributes (like cache friendliness and stability) depend entirely on the heap implementation.

Comparison-Free Sorting

Now we're leaving the friendly, inclusive world of general-purpose sorting algorithms to get into the real speed demons: the sorting algorithms that don't do any comparisons to sort! These all exploit peculiarities in the data to be sorted to eliminate all comparisons and run in linear time.

Counting Sort

Counting sort relies on its input data to be constrained to a small set of possible values; e.g. student ages. To do a counting sort, make one pass through the data counting how many elements there are with each value. Then, using that information, count the position that each value should start at and then run through the data again, putting each element in the position where it belongs. Going back to the student ages example, if there are three 17 year olds, four 18 year olds and 2 19 year olds, we have starting positions 0, 3, 5. If the first element is a 19-year old, they go in position 5 and the 19 year olds position is incremented to 6. 

Counting sort runs in linear time and very cache friendly, but it does require some extra memory (to store the counts) and won't work if the potentially valid inputs are unbounded.

Radix Sort

I'm going to deviate from the normal way of discussing sorting by using a list of strings (on the off chance that you're not familiar with what strings are, they're just a sequence of characters) instead of a list of integers to discuss radix sort, because it's more obvious what's going on with strings.

To radix sort a list of strings, sort all of the strings based on their first character, then within the sublist of each first letter, sort again by the second letter, and so on until the entire list is completely sorted. By restricting it to the first letter, we open up the possibility of using counting sort to perform that sort, so we still avoid doing comparisons. 

Now, to take radix sort back to integers, we may have to do something a little odd. If you're considering arbitrary length integers and you choose your radix (base) to be 10, you have to do the sort starting with the least significant digit. When dealing with most integers, though, we'd choose the radix to be 256 so that we're sorting a byte at a time; combine that with most integers being either 32 or 64-bit numbers and we get really good performance and linear-time performance. 

The runtime of radix sort is linear, but it's linear both in terms of the number of elements and the average length of the prefixes of those elements that have to be considered to get them sorted (technically all of the comparison-based sorts also have that caveat, but it gets hidden as a comparison, which is typically treated as a constant-time operation).

Bucket Sort

Bucket sort has a lot in common with counting & radix sort and can be easily combined with them. The basic idea behind bucket sort is that you put each element into a bucket and then sort the buckets. Going back to the old standby of sorting people by age, you may do it by putting people in buckets of their age so everyone of any age clumps together. Bucket sort plays nicely with hash maps, but isn't as cache friendly as counting and radix sort. One big benefit of bucket sort, though, is that the valid values can be infinite, since you can make buckets as needed instead of having to create them all ahead of time.

Well, that hits the highlights of sorting! Next time, we'll look into searching (a collection for an item; things like searching for documents containing a word is a more advanced topic, and I may cover it later) and maybe touch on some graph algorithms.

No comments: