Thursday, May 30, 2013

Crash Course on CS 101

This is part 1 of a (probably short) series that'll be a survey of some of the basic concepts in computer science. I'll start with basic, fundamental stuff and work up from there.

The Basic Data Structures

Lists

The granddaddy of 'em all, lists (or their cousins, the humble array) underlie most other data structures and algorithms. Lists provide one wonderful guarantee; they keep things in the order that you tell them to. The basic instructions we have on a list are to add things to them, remove things from them and enumerate what's in them. Lists are implemented in lots of ways, but the two most common are linked lists and array lists. 

A linked list is probably the easiest list to implement and is formed by nodes which point at each other. Linked lists come in two (basic) forms, singly-linked and doubly-linked, with the only difference being that doubly-linked lists also retain a pointer to the prior node. Adding things to, and removing things from, the beginning of a linked list is extremely fast, since it's just assigning a pointer. Depending on the implementation (basically whether you store a tail pointer or not), the same can be said for operating on the end of the list. There are even algorithms for allowing concurrent modification of linked lists without locking, but that's a whole other level of complexity from what we're discussing here; if you're dying to dig into those details, search for "lock free linked list".

So, linked lists are dirt simple to implement, offer instantaneous access to the first & last elements in them and can even be implemented in a thread-safe lock-free manner; given all that awesomeness, you may be wondering why would anyone use any other kind of list? Linked lists have a handful of issues; first is that for every element in them, you have at least one pointer so you can get to the next item (two if it's a doubly-linked list and you're not storing the pointers packed into one). The second issue is that to access any particular element in the list, you have to access every element before it. Finally, there's the problem that linked lists are so not cache-friendly that it's probably more accurate to call them cache hostile. It's entirely possible that every single element that you access will mean a cache miss and, if you're phenomenally unlucky a TLB (Translation Lookaside Buffer; really important, but the details are tangential to this discussion) miss or even a page fault. In plain English, they can make your program very, very slow.

The other common implementation of lists backs then with an array. If you're using a C++ vector, or a Java ArrayList or Vector (the only difference between these two is that operations on Vector are all synchronized), you're using an array-backed implementation of list. An array-backed list can access any of its elements in constant time (i.e. it doesn't depend on how many elements are in the list) and it's extremely cache-friendly when accessed in order. 

Sadly, array backed lists are also not a panacea. Insertions and deletions at the head of an array-backed list require shifting every element of the list. Also, the list can get full, requiring resizing it and potentially copying the elements (you may get lucky and have realloc succeed at just resizing the array in place). To reduce the cost of resizing, we double the capacity of the list whenever it's full, so that the time to add to the list is, on average, constant.

Stacks

Stacks are the "hidden" data structure; you use them all the time even if you're not aware of it. Every function invocation uses the stack. A stack is exactly what it sounds like, it's like a stack of plates, the last one that you put on it is the first thing that'll be removed. That's why they're also called LIFOs (Last-In First-Out). Stacks have a much more restricted set of operations than lists, supporting only push  (to put things on top of the stack) and pop (to remove things from the top of the stack). It's fairly easy to implement a stack in terms of a list and, in fact, that's usually how they are implemented. 

Queues

If you've ever had to stand in line for something, you're familiar with queues. They're a lot like lists, except they don't support random access, (except priority queues, which let some items cut in line) you only get things out in the order in which they were added. 

Queues are pretty easy to implement with a list, however for performance we usually use a circular array. Basically, instead of shifting everything toward the front when we add an element, we just keep incrementing the position to write to, and then do a mod on the number to map it into the available positions. Basically, the position to write to just wraps around to start writing again at the beginning.

Sets

A set (in the context of data structures) is just a collection that does not allow duplicates. Unlike lists, queues & stacks, sets have no particular ordering. Sets generally support adding and removing elements and asking if an item is a member of the set. You can implement a set using a list, but that's fairly inefficient, so they're usually implemented with hash tables (which I'll get to shortly).

Bags

Bags are basically just sets that permit duplicates and are usually implemented the same way. There's really not much more to say about them.

More Advanced Data Structures

Now it's time to start digging in to some more advanced data structures; we're definitely not in Kansas anymore!

Graphs

I'm a bit hesitant to call a graph a data structure, since there's enough research on graphs to be a field of study unto themselves. That said, they are a way of organizing data, and some of the data structures that I'm going to discuss shortly are defined in terms of a graph, so here's a quick description (I'll discuss them in more depth when I cover graph algorithms). A graph is a series of nodes, called vertices (vertex is the singular) connected by edges; it's a lot like a subway system, where the stations are vertices and the rail lines themselves edges. Graphs of one variety or another are one of the most widely used data structures, in no small part because edges can have weights, which opens the way for us to answer a lot of interesting questions.

Graphs can be directed or undirected. If the graph is directed, then you can only traverse edges in one direction, like one-way roads. Some graphs also have loops (more formally, cycles) leading from a node back to itself via its children (not counting an undirected link to/from a child node); the ones that do are called cyclic graphs. Graphs without loops are called, reasonably enough, acyclic.

The two most common implementations for graphs are as adjacency lists & adjacency matrices. Adjacency lists store the nodes and edges explicitly; e.g. you may have an array of nodes, each of which with a list of the edges connected to it. An adjacency matrix is an n x n matrix storing whether each node has an edge to each other node. Adjacency lists are a nice, compact form for storing sparse graphs (that is, graphs that have relatively few edges connecting their nodes); adjacency matrices are well-suited to storing graphs where many nodes are connected by edges.

Trees

Trees are, formally, directed acyclic graphs with a single most ancestral node, called the root node. We call the nodes farthest from the root leaves. Trees come in lots of shapes and sizes, but one of the most common is the binary tree, and especially the binary search tree. A binary search tree has the handy feature that, for each node, its left child node (and all of that node's children) are less than it and, likewise, its right node and its child nodes are greater. That lets us search the tree for any particular element in O(log n) time (assuming the tree is balanced; i.e. the leaves are all at about the same distance from the root); I'll discuss exactly what O(log n) means in the algorithms post, but for now it's enough to know that that means that you have to square the number of elements in the tree to double the time that it takes to find an element.

The most obvious implementation of trees is almost identical to a linked list except, instead of a Next pointer, you have Left and Right pointers. A more clever, and efficient, implementation stores the nodes as elements in an array without needing the pointers. For any node in position n of the array (with the root starting at 0), the left child goes in position (2*n + 1) and the right child goes in (2*n + 2)

If you're interested in learning about other types of trees (and you certainly should be!) , you should look into tries (very similar to trees, but really useful for searching for strings), suffix trees, B-trees and B+ trees. Also worth looking into are red-black trees, which provide a mechanism for keeping trees balanced.

Heaps

Heaps are a form of (typically binary) tree with the property that the value of every node is less than the value of its children, which, by extension, means that it's less than all of its descendants. We can use that to sort a bunch of information or to select the top (or bottom) n elements (where n is however many elements you want). The trick to maintaining heaps as you remove elements is to select the lesser of the root's children and make it the new root, then repeat for the empty spot that it left (all the way down the heap). Adding an element follows a similar path, but backwards. Add it to the bottom, then if it's smaller than its parent, swap them and repeat until it's either larger than its parent or it's at the root.

Hash Tables

Hash tables (also called hash maps, maps and dictionaries) are a vitally important data structure. As their alternate names suggest, hash tables let you map one value to another, e.g. a word to its definition. The beauty of hash tables is that they let you retrieve the value that you associate with the key in (usually) constant time; that is, the time to retrieve an item is independent of how many items are in the hash table. Internally, a hash table is a list of "buckets". The buckets are where you put the element to which you're mapping the key. 

The way that hash tables give such quick lookup times is that they apply a function (in the math sense of the word) to the key to generate a number. That function is expected to always return the same value for any given input (e.g. if you pass it the phrase "Bacon is awesome!" three times today, twice tomorrow, and seven times the day after, it should give you the same number back each time). It then takes that number, divides it by the number of buckets in the table, and then uses the remainder (this is called taking the modulo or modular arithmetic; most programming languages have an operator that'll do it for you) as the index of the bucket to put the element into. You're probably thinking, "Hey, what if two keys end up hashing to the same bucket?!" There are numerous ways of dealing with that, but the simplest are called chaining and linear probing. With chaining, you don't store the element directly in the bucket, you store a list in the bucket and just add the element to the end of that list. In linear probing, if you find that the bucket that the key hashed to is full, you try to fit it into the next bucket; if that's full, try the one after that and keep trying until you find an empty bucket (or you decide to resize your list of buckets). Linear probing can be a bit trickier, but due to cache effects, tends to be faster.

To implement a set with a hash table, just use the element to add to the set as the key and map it to some value; the exact value doesn't really matter. If the table contains the key, then the element is in the set. If you make the value an integer, though, you can keep a count and turn the set into a bag. 

To Be Continued...

This post has gotten pretty long, but it's covered most of the basic data structures. In the next post (coming some bat time to the same bat channel), we'll cover some sorting algorithms. 

Saturday, May 25, 2013

SSL Gently

A lot of people find SSL (and TLS) confusing; hopefully this will help clarify how it works. Before we get started, let's define some terms.

Symmetric Key Encryption
This is what you probably think of when you think of encryption. It's the oldest form of encryption and relies on a shared secret. Basically, if two parties have the shared key and no one else does, they can use that key to communicate without anyone else being able to read what they're saying. This has the benefit of being relatively fast, with the obvious downside of requiring a way to securely share a secret.
Asymmetric Key Encryption
Also known as public key encryption and PKI (though PKI is more than just asymmetric key encryption), asymmetric key encryption uses a pair of keys instead of a single key known by both sides. Of the pair of keys, one is called the public key and the other the private key. As the name suggests, you share the public key with the parties with whom you're communicating, but keep the private key private. Asymmetric key encryption uses math concepts like exponentiation, modular arithmetic and elliptic curves to allow you to encode information with one key and decode it with the other, but not allow anyone who doesn't have the other key to decode it. Don't worry too much if you don't understand modular arithmetic or elliptic curves; crypto libraries will take care of the math for you (while also protecting from vulnerabilities that could be introduced by coding errors). People who want to send information to you that only you can read would encrypt it with your public key and you would then decrypt it with your private key. On the other hand, you can encrypt something with your private key and then people can see that it really is from you by checking it against your public key.
Cipher Suite Basically
just a kind of encryption that you support, similar to announcing protocols that you understand. An example would be AES256, which just says that you're willing to use 256-bit AES (Advanced Encryption Standard)
Digital Signature
A digital signature is a way to prove that something came from you. The basic way that they work is you use a secure hashing algorithm to take a digest of the message that you want to send and then encrypt that digest with your private key. If you're not familiar with digests & secure hashing, all that it means is that you use a function that maps a stream of information of arbitrary length into a sequence of bytes of (frequently shorter) fixed size; the most important features of the function are that given any two messages, it is highly unlikely that they'll map to the same sequence of bytes and, if you make any change to a message, even if it's a single bit, the output value will be vastly different (the goal is for ~50% of the bits to change and for the changed bits to be spread throughout the result)
X.509 Certificate
Often just called a certificate, an x.509 certificate is a way to assert your identity by pairing some information about you (frequently a name, an organization with which you're affiliated, and a location), along with the public key of your public/private key pair. To keep people from messing with the name/key pair in transit, the certificate gets signed. It can either be signed by you, in which case it's called a self-signed certificate, or it can be signed by a certificate authority
Certificate Authority
Frequently abbreviated as CA, a certificate authority is a trusted third party; specifically, you trust the CA when it tells you that a certificate is owned by the entity that claims to own it. Technically anyone can be a CA if they set up a key with which to sign certificates, but that doesn't mean you have to trust them. As an example, if you have a friend named Bill, who tells you that the person you just met is Jill, then Bill is basically taking the same role as a CA.
Trust Chain
A trust chain is basically transitive trust. Basically, a CA doesn't necessarily just assert that a person is why they say they are, they can also assert that that person is trustworthy to identify other people for you. Continuing our example, if Bill tells you that he trusts Jill to identify people, and Jill tells you that a person you're meeting is Steve, then that's a trust chain. You trust Steve's identity because you trust Bill who trusts Jill who asserted that the person is Steve.
Trust Store
I'm not sure how widely this term is used outside of Java, but pretty much everything that does asymmetric key encryption has a similar concept. The trust store is just a central place to store the certificates (and public keys) of the CAs that you trust
Key Store
This stores your certificate and private key and, thus, needs to be protected. This & the trust store cause quite a bit of confusion for folks since, with Java, they're both implemented as keystores.

Ok, I admit, that may have been a fire hose of information, but hopefully you're still with me, because armed with that information the next bit is a fair bit simpler. I'm going to leave out a lot of details (like exchanging pre-master secret information) that's essential to implementing TLS, but just muddies the waters when you're trying to understand how it works at a high level. If you're looking for all of the nitty-gritty details, they're laid out in the TLS RFC.


  1. Client Hello - The client sends a message asking to start the TLS handshake, announcing various bits of information like whether it supports compression and what cipher suites it supports
  2. Server Hello - The server responds, confirming the use of compression and the cipher suite that'll be used
  3. Server Certificate - The server presents its certificate (from its key store) to the client
  4. The server asks the client for its certificate if it's doing mutual authentication
  5. The server then sends a message that it's finished saying Hello
  6. If the server requested the client's certificate, the client provides it
  7. The server looks in its trust store for the CA that signed the client's certificate; if it doesn't trust the CA that signed the client's cert, it won't accept that certificate as valid
  8. The client generates a message including a random number and encrypts it with the server's public key and then sends it to the server (basically everything sent to the server from this point until the end of the handshake will be encrypted using the server's public key)
  9. The client then takes a digital signature of everything that has transpired so far (including the random numbers that have been passed back and forth) using the client's private key, and sends that to the server
  10. The server then checks the digital signature provided by the client and, if it matches, then it knows that the client is in possession of the private key associated with the certificate
  11. The client & server exchange ChangeCipherSpec messages that basically say that everything from this point will be encrypted with the symmetric key that they agreed upon during the handshake (they switch to symmetric key encryption because it's much faster)
  12. The client & server exchange messages to say that they're finished with the handshake; periodically they'll re-send ChangeCipherSpec messages to change the symmetric key that they're using because reusing a symmetric key has a certain amount of risk and the more messages you send with it, the greater the risk



Sunday, May 19, 2013

Multithreading & Performance

If you're writing multithreaded code, you're probably doing it to improve performance; either that or you know who'll be maintaining the code and you really don't like them. That makes it all the more painful that making code multithreaded can potentially not only not help, but actually hurt performance if you're not careful.  I'll try to cover some of the common performance pitfalls here and offer some recommendations to fix them.

Excessive Synchronization

By definition, only one thread can be hold a mutex (e.g. be in a particular synchronized block) at once, so whenever you synchronize you serialize your code. From that, we can see that the more of your code is in a mutex, the less benefit you're going to get from using multiple threads. This gotcha seems to be fairly common in web applications, especially in its most insidious form, the static synchronized method. When you see static synchronized in a web application, it's telling you that for the duration of that method you can have exactly one concurrent user; not exactly the height of scalability. Synchronized methods are painful because they acquire the mutex on method entry and don't release it until the method is done; adding to the cost is that they synchronize on this, so you can only be in one synchronized method of that instance at a time. Synchronized static methods kick it up a notch by synchronizing on the actual class object, so that only one thread can be in any static synchronized method of the class.

There are two aspects to avoiding excess synchronization. The first, and easiest, is to not hold a lock if you don't need to. For example, a method that splits a string on commas shouldn't need synchronized (unless it's doing so in a streaming fashion so that it's maintaining state whose scope is outside the method).

The second, and more complicated, aspect is increasing the granularity of your locks; i.e. introducing additional lock objects so that each one protects a smaller portion of the code. A concurrent hash map is probably the canonical example of replacing a coarse lock with fine locks. A thread-safe hash map could use one lock to protect its state, but that would limit interaction with it to one thread at a time. A more scalable implementation may use a separate lock per bucket, so that different threads can access the map simultaneously as long as they're not accessing the same bucket.

Sharing is (Not) Caring

Writing to shared state from multiple threads will lead to one of two things (in a multi-core system):
1) Lots of memory traffic and cache misses (poor performance)
2) Incorrect results because you don't have proper synchronization

If you're really unlucky, you get both poor performance and incorrect results. A simple example of where you may run into this would be with counters, either visit counters, performance counters or even simply summing the elements of a large list. Take the following code, for example:


public class Summation {
   private int sum = 0;
   private Lock lock = new ReentrantLock();
   private class SumCalculator extends Thread {
      public SumCalculator(List<integer> list, int start, int end) {
         for(int i = start; i < end; ++i) {
            lock.lock();
            try {
               sum += list.get(i);
            } finally { lock.unlock(); }
         }
      }
   }
}

The sum will be correctly calculated by the SumCalculator (assuming the list is split up correctly), however it's highly likely that performance will be as bad, if not significantly worse, than calculating the sum in a single thread. One very simple change, however, will dramatically improve performance. If the sum is moved into the thread (so that its scope is limited to a single SumCalculator), then there's no need for locking and less traffic on the memory bus. Of course, you still need to calculate the sum of the threads' sums, which could take a while if there are a lot of threads, but in that case the same approach can be applied to calculate their sums in parallel (at which point you should be looking at fork-join, since it's well suited to divide-and-conquer problems like that).

If you've been reading the Javadoc for java.util.concurrent.atomic, you may be wondering why not use AtomicInteger, since that would remove the need for the lock. The problem with AtomicInteger in this case is that it would probably experience worse performance than the locking approach because it wouldn't reduce memory traffic (in fact, it may greatly increase it) and it would be highly contended. The Atomic* classes use Compare-and-Set (CAS) instead of locking. What that means is that addAndGet (and all of the other mutating methods) get the value, apply the mutation and then atomically set the value if the value had not changed since the get. In pseudocode, it looks like this:

while(true) {
   val origValue = getValue()
   val newValue = origValue + delta;
   atomic {
      if(origValue == getValue() ) {
         setValue(newValue)
         break;
      }
   }
}


As you may have guessed, if the value is changing frequently that loop may execute many, many times, leading to potentially very poor performance.