Thursday, May 30, 2013

Crash Course on CS 101

This is part 1 of a (probably short) series that'll be a survey of some of the basic concepts in computer science. I'll start with basic, fundamental stuff and work up from there.

The Basic Data Structures

Lists

The granddaddy of 'em all, lists (or their cousins, the humble array) underlie most other data structures and algorithms. Lists provide one wonderful guarantee; they keep things in the order that you tell them to. The basic instructions we have on a list are to add things to them, remove things from them and enumerate what's in them. Lists are implemented in lots of ways, but the two most common are linked lists and array lists. 

A linked list is probably the easiest list to implement and is formed by nodes which point at each other. Linked lists come in two (basic) forms, singly-linked and doubly-linked, with the only difference being that doubly-linked lists also retain a pointer to the prior node. Adding things to, and removing things from, the beginning of a linked list is extremely fast, since it's just assigning a pointer. Depending on the implementation (basically whether you store a tail pointer or not), the same can be said for operating on the end of the list. There are even algorithms for allowing concurrent modification of linked lists without locking, but that's a whole other level of complexity from what we're discussing here; if you're dying to dig into those details, search for "lock free linked list".

So, linked lists are dirt simple to implement, offer instantaneous access to the first & last elements in them and can even be implemented in a thread-safe lock-free manner; given all that awesomeness, you may be wondering why would anyone use any other kind of list? Linked lists have a handful of issues; first is that for every element in them, you have at least one pointer so you can get to the next item (two if it's a doubly-linked list and you're not storing the pointers packed into one). The second issue is that to access any particular element in the list, you have to access every element before it. Finally, there's the problem that linked lists are so not cache-friendly that it's probably more accurate to call them cache hostile. It's entirely possible that every single element that you access will mean a cache miss and, if you're phenomenally unlucky a TLB (Translation Lookaside Buffer; really important, but the details are tangential to this discussion) miss or even a page fault. In plain English, they can make your program very, very slow.

The other common implementation of lists backs then with an array. If you're using a C++ vector, or a Java ArrayList or Vector (the only difference between these two is that operations on Vector are all synchronized), you're using an array-backed implementation of list. An array-backed list can access any of its elements in constant time (i.e. it doesn't depend on how many elements are in the list) and it's extremely cache-friendly when accessed in order. 

Sadly, array backed lists are also not a panacea. Insertions and deletions at the head of an array-backed list require shifting every element of the list. Also, the list can get full, requiring resizing it and potentially copying the elements (you may get lucky and have realloc succeed at just resizing the array in place). To reduce the cost of resizing, we double the capacity of the list whenever it's full, so that the time to add to the list is, on average, constant.

Stacks

Stacks are the "hidden" data structure; you use them all the time even if you're not aware of it. Every function invocation uses the stack. A stack is exactly what it sounds like, it's like a stack of plates, the last one that you put on it is the first thing that'll be removed. That's why they're also called LIFOs (Last-In First-Out). Stacks have a much more restricted set of operations than lists, supporting only push  (to put things on top of the stack) and pop (to remove things from the top of the stack). It's fairly easy to implement a stack in terms of a list and, in fact, that's usually how they are implemented. 

Queues

If you've ever had to stand in line for something, you're familiar with queues. They're a lot like lists, except they don't support random access, (except priority queues, which let some items cut in line) you only get things out in the order in which they were added. 

Queues are pretty easy to implement with a list, however for performance we usually use a circular array. Basically, instead of shifting everything toward the front when we add an element, we just keep incrementing the position to write to, and then do a mod on the number to map it into the available positions. Basically, the position to write to just wraps around to start writing again at the beginning.

Sets

A set (in the context of data structures) is just a collection that does not allow duplicates. Unlike lists, queues & stacks, sets have no particular ordering. Sets generally support adding and removing elements and asking if an item is a member of the set. You can implement a set using a list, but that's fairly inefficient, so they're usually implemented with hash tables (which I'll get to shortly).

Bags

Bags are basically just sets that permit duplicates and are usually implemented the same way. There's really not much more to say about them.

More Advanced Data Structures

Now it's time to start digging in to some more advanced data structures; we're definitely not in Kansas anymore!

Graphs

I'm a bit hesitant to call a graph a data structure, since there's enough research on graphs to be a field of study unto themselves. That said, they are a way of organizing data, and some of the data structures that I'm going to discuss shortly are defined in terms of a graph, so here's a quick description (I'll discuss them in more depth when I cover graph algorithms). A graph is a series of nodes, called vertices (vertex is the singular) connected by edges; it's a lot like a subway system, where the stations are vertices and the rail lines themselves edges. Graphs of one variety or another are one of the most widely used data structures, in no small part because edges can have weights, which opens the way for us to answer a lot of interesting questions.

Graphs can be directed or undirected. If the graph is directed, then you can only traverse edges in one direction, like one-way roads. Some graphs also have loops (more formally, cycles) leading from a node back to itself via its children (not counting an undirected link to/from a child node); the ones that do are called cyclic graphs. Graphs without loops are called, reasonably enough, acyclic.

The two most common implementations for graphs are as adjacency lists & adjacency matrices. Adjacency lists store the nodes and edges explicitly; e.g. you may have an array of nodes, each of which with a list of the edges connected to it. An adjacency matrix is an n x n matrix storing whether each node has an edge to each other node. Adjacency lists are a nice, compact form for storing sparse graphs (that is, graphs that have relatively few edges connecting their nodes); adjacency matrices are well-suited to storing graphs where many nodes are connected by edges.

Trees

Trees are, formally, directed acyclic graphs with a single most ancestral node, called the root node. We call the nodes farthest from the root leaves. Trees come in lots of shapes and sizes, but one of the most common is the binary tree, and especially the binary search tree. A binary search tree has the handy feature that, for each node, its left child node (and all of that node's children) are less than it and, likewise, its right node and its child nodes are greater. That lets us search the tree for any particular element in O(log n) time (assuming the tree is balanced; i.e. the leaves are all at about the same distance from the root); I'll discuss exactly what O(log n) means in the algorithms post, but for now it's enough to know that that means that you have to square the number of elements in the tree to double the time that it takes to find an element.

The most obvious implementation of trees is almost identical to a linked list except, instead of a Next pointer, you have Left and Right pointers. A more clever, and efficient, implementation stores the nodes as elements in an array without needing the pointers. For any node in position n of the array (with the root starting at 0), the left child goes in position (2*n + 1) and the right child goes in (2*n + 2)

If you're interested in learning about other types of trees (and you certainly should be!) , you should look into tries (very similar to trees, but really useful for searching for strings), suffix trees, B-trees and B+ trees. Also worth looking into are red-black trees, which provide a mechanism for keeping trees balanced.

Heaps

Heaps are a form of (typically binary) tree with the property that the value of every node is less than the value of its children, which, by extension, means that it's less than all of its descendants. We can use that to sort a bunch of information or to select the top (or bottom) n elements (where n is however many elements you want). The trick to maintaining heaps as you remove elements is to select the lesser of the root's children and make it the new root, then repeat for the empty spot that it left (all the way down the heap). Adding an element follows a similar path, but backwards. Add it to the bottom, then if it's smaller than its parent, swap them and repeat until it's either larger than its parent or it's at the root.

Hash Tables

Hash tables (also called hash maps, maps and dictionaries) are a vitally important data structure. As their alternate names suggest, hash tables let you map one value to another, e.g. a word to its definition. The beauty of hash tables is that they let you retrieve the value that you associate with the key in (usually) constant time; that is, the time to retrieve an item is independent of how many items are in the hash table. Internally, a hash table is a list of "buckets". The buckets are where you put the element to which you're mapping the key. 

The way that hash tables give such quick lookup times is that they apply a function (in the math sense of the word) to the key to generate a number. That function is expected to always return the same value for any given input (e.g. if you pass it the phrase "Bacon is awesome!" three times today, twice tomorrow, and seven times the day after, it should give you the same number back each time). It then takes that number, divides it by the number of buckets in the table, and then uses the remainder (this is called taking the modulo or modular arithmetic; most programming languages have an operator that'll do it for you) as the index of the bucket to put the element into. You're probably thinking, "Hey, what if two keys end up hashing to the same bucket?!" There are numerous ways of dealing with that, but the simplest are called chaining and linear probing. With chaining, you don't store the element directly in the bucket, you store a list in the bucket and just add the element to the end of that list. In linear probing, if you find that the bucket that the key hashed to is full, you try to fit it into the next bucket; if that's full, try the one after that and keep trying until you find an empty bucket (or you decide to resize your list of buckets). Linear probing can be a bit trickier, but due to cache effects, tends to be faster.

To implement a set with a hash table, just use the element to add to the set as the key and map it to some value; the exact value doesn't really matter. If the table contains the key, then the element is in the set. If you make the value an integer, though, you can keep a count and turn the set into a bag. 

To Be Continued...

This post has gotten pretty long, but it's covered most of the basic data structures. In the next post (coming some bat time to the same bat channel), we'll cover some sorting algorithms. 

No comments: