Adam's Tech Thoughts: 2013

Sunday, June 23, 2013

CS 101 Part 2

It's time for the (probably not so) eagerly awaited part 2 of my intro to comp. sci series! We're moving on now to the algorithms which, despite their name, have nothing to do with the former vice president's musical stylings. Algorithms are basically recipes for doing things, and the word comes from the name of the guy who gave us algebra (al-Khwarizmi). Since computers spend an inordinate amount of their time sorting, that's where we'll start, after a brief detour into Big-O notation.

Big O Notation

For most things that you need a computer to do, it takes more time the more things you need it to operate on, for example it takes longer to calculate the sum of 100 numbers than it does for 10; makes sense, right? What Big O notation gives us is a way to specify an upper bound on how badly an algorithm slows down as the number of elements that it has to consider grows. Some examples might help make that clearer. Calculating the sum of n numbers takes O(n) time, since the time grows linearly with the number of elements (we say that it runs in linear time, for obvious reasons). If you want to calculate the result of multiplying every number in that list by every other number in it, that's O(n²) since for all n elements, you have to perform n calculations, so the total number of calculations is n * n.

To simplify things, with Big O notation we focus on the dominating aspect of the algorithm. By that, I mean we focus on the aspect of the algorithm that tends to chew up the most time. For example, if you have x + y = z, and x is 0.00001 and y is 1,000,000 very nearly all of the value of z is determined by y, so we'd say that y dominates. Likewise, n³ dominates n², which dominates n. Focusing on the dominating component lets us simplify by dropping the less significant portions, e.g. O(5n³+ 975n²+ 8,000,000) is simplified to just O(n³) since, as n approaches infinity, the n³ portion will dominate the other components.

It's important to note that Big O notation does not tell us which of two algorithms is faster, it just gives us an idea for how an algorithm will behave as the input size grows. For example, there are some algorithms for operations on matrices that have asymptotically better behavior (that is, behavior as the input size approaches infinity) than the algorithms that are generally used. The catch, though, is that have hidden constants so large that they're slower than the simpler algorithm for any input size smaller than every molecule in the known universe.

Some runtimes that you'll need to know because they come up frequently are:

O(n): Linear time; if you double the input size, the runtime doubles and if you square it, then the runtime squares
O(n log n): The log is base-2, which means that to see an increase of 1, n has to be doubled
O(n²): Quadratic time; the runtime grows as the square of its input size, so if you double n, the runtime goes up 4 times; quadruple it and the runtime goes up 16 times.

Sorting

It's at about this point that most books introduce bubble sort. I'm not sure why, since it's fairly complicated, not particularly intuitive and it has a worst-case runtime of O(n²). Instead, I'm going to introduce insertion sort and selection sort, both of which are fairly intuitive since they're the way that humans sort. Most of the sorts that we'll be examining depend on this one observation: a list of size one is always sorted.

Selection Sort

The idea behind selection sort is that you look through the list looking for the smallest element. Once you find it, you put it at the head of the list. Then you look through the list for the 2^nd smallest element and put it in the 2^nd position. Just keep repeating that until you're done. Selection sort is simple and easy to implement; it also happens to take O(n²) time, so it's fairly slow on large lists.

Insertion Sort

Insertion sort is not vastly different from selection sort, even though it's conceptually working in exactly the opposite way. With insertion sort, you take each element, find the position in the list where it belongs, and put it there. This depends on that key observation before, that a list of size one is already sorted; we use that to say that the first element of the list is sorted and then we proceed to extend the length of that list one element at a time. Insertion sort is simple, easy to implement and really fast when the list is small, but its worst-case runtime (when the list is in exactly the opposite order of what you want) is O(n²), so it can be horrendously slow.

We can make insertion sort a little bit more complicated, though, and have its runtime approach linear the closer it gets to being sorted. If we only look for the proper position of an element if it's not already in its proper position, then a sorted list will be "sorted" in linear time. To do that, we only look for the proper position of an element if it's smaller than the element before it (assuming you're sorting from smallest to largest; otherwise look for its proper position if it's larger than the element before it). You could also use binary search to limit its worst case to O(n log n), but if you do so you'd lose all of insertion sort's benefits.

Divide and Conquer Sorting Algorithms

Now we're getting into a class of sorting algorithms that trade simplicity and obviousness for speed. The underlying observation that makes all of them work in theory is the one above, that a list of size one is always sorted (in practice, we usually get down to a sublist of a couple dozen elements then use insertion sort on those). Building from that observation, we can say that a list is comprised of a lot of sorted sub-lists that we just need to merge. All of the following algorithms boil down to how do we split the list into sublists and how do we merge them back together.

Quick Sort

With quick sort, we pick some value from the list, called a pivot. Then, we split the list into two lists, one containing all of the elements less than the pivot, the other containing all of the elements greater than the pivot, then do the same to each of the sub-lists. This will eventually get you to the point where all of the sublists are sorted, because they all contain 1 term. In terms of implementation, start from the first element of the list moving toward the end until you find an element larger than the pivot; then starting from the end move towards the beginning until you find an element smaller than the pivot and then swap them. When the indexes meet, you've got the lists split and it's time to do the same to the two sublists.

The runtime of quick sort is a bit tricky. Its average case run time is O(n log n), but it's worst case is O(n2). Fortunately, we rarely see its worst-case (there are tricks that can be pulled to make the odds of seeing its worst case 1/n!; that's n-factorial, not just an exclamation mark).

Quick sort is a good default sorting algorithm because it's fairly simple & cache friendly and, since the splitting phase does all of the work, the merge phase is trivial.

Merge Sort

Where quick sort does all of its work in the split phase, merge sort does all of its work in the merge phase. The split phase just splits the list in half and then sorts the two sublists with merge sort. Doing that repeatedly will eventually get you down to n sorted sublists, each with one element. The merge phase proceeds by taking two sorted sublists and combining them into one sorted list by taking the smallest element of the two lists and putting it in the next position in the combined list until both lists are exhausted.

Merge sort has a very simple run-time to consider; its best, worst and average case times are all O(n log n).

Merge sort has the advantages of being pretty simple to implement and cache friendly, but it has the disadvantage of requiring additional memory proportional to the amount taken by the list to be sorted. There are ways of implementing merge sort that do not require additional memory, but they are more complicated and slower (they're not slower in Big-O terms, but they are slower in practice). Merge sort also has the nice property that it's a stable sort (if the list contains two elements that compare equally, the first one will end up first; e.g. if you want students sorted by age and then, within an age bracket by name, you can sort by name and then sort by age and it'll line up properly). Finishing off the list of nice properties of merge sort is that its cache-friendliness extends to disk; if your data is too large to fit into RAM, you're probably going to be implementing at least the merge phase of merge sort on disk.

Heap Sort

This is a bit of an oddity in that it's not a divide-and-conquer algorithm, it just exploits the properties of a heap. Insert all of the elements to be sorted into a heap, then remove them and they'll be ordered. Like the other comparison-based sorts, heap sort is O(n log n), though all of its other attributes (like cache friendliness and stability) depend entirely on the heap implementation.

Comparison-Free Sorting

Now we're leaving the friendly, inclusive world of general-purpose sorting algorithms to get into the real speed demons: the sorting algorithms that don't do any comparisons to sort! These all exploit peculiarities in the data to be sorted to eliminate all comparisons and run in linear time.

Counting Sort

Counting sort relies on its input data to be constrained to a small set of possible values; e.g. student ages. To do a counting sort, make one pass through the data counting how many elements there are with each value. Then, using that information, count the position that each value should start at and then run through the data again, putting each element in the position where it belongs. Going back to the student ages example, if there are three 17 year olds, four 18 year olds and 2 19 year olds, we have starting positions 0, 3, 5. If the first element is a 19-year old, they go in position 5 and the 19 year olds position is incremented to 6.

Counting sort runs in linear time and very cache friendly, but it does require some extra memory (to store the counts) and won't work if the potentially valid inputs are unbounded.

Radix Sort

I'm going to deviate from the normal way of discussing sorting by using a list of strings (on the off chance that you're not familiar with what strings are, they're just a sequence of characters) instead of a list of integers to discuss radix sort, because it's more obvious what's going on with strings.

To radix sort a list of strings, sort all of the strings based on their first character, then within the sublist of each first letter, sort again by the second letter, and so on until the entire list is completely sorted. By restricting it to the first letter, we open up the possibility of using counting sort to perform that sort, so we still avoid doing comparisons.

Now, to take radix sort back to integers, we may have to do something a little odd. If you're considering arbitrary length integers and you choose your radix (base) to be 10, you have to do the sort starting with the least significant digit. When dealing with most integers, though, we'd choose the radix to be 256 so that we're sorting a byte at a time; combine that with most integers being either 32 or 64-bit numbers and we get really good performance and linear-time performance.

The runtime of radix sort is linear, but it's linear both in terms of the number of elements and the average length of the prefixes of those elements that have to be considered to get them sorted (technically all of the comparison-based sorts also have that caveat, but it gets hidden as a comparison, which is typically treated as a constant-time operation).

Bucket Sort

Bucket sort has a lot in common with counting & radix sort and can be easily combined with them. The basic idea behind bucket sort is that you put each element into a bucket and then sort the buckets. Going back to the old standby of sorting people by age, you may do it by putting people in buckets of their age so everyone of any age clumps together. Bucket sort plays nicely with hash maps, but isn't as cache friendly as counting and radix sort. One big benefit of bucket sort, though, is that the valid values can be infinite, since you can make buckets as needed instead of having to create them all ahead of time.

Well, that hits the highlights of sorting! Next time, we'll look into searching (a collection for an item; things like searching for documents containing a word is a more advanced topic, and I may cover it later) and maybe touch on some graph algorithms.

Thursday, May 30, 2013

Crash Course on CS 101

This is part 1 of a (probably short) series that'll be a survey of some of the basic concepts in computer science. I'll start with basic, fundamental stuff and work up from there.

The Basic Data Structures

Lists

The granddaddy of 'em all, lists (or their cousins, the humble array) underlie most other data structures and algorithms. Lists provide one wonderful guarantee; they keep things in the order that you tell them to. The basic instructions we have on a list are to add things to them, remove things from them and enumerate what's in them. Lists are implemented in lots of ways, but the two most common are linked lists and array lists.

A linked list is probably the easiest list to implement and is formed by nodes which point at each other. Linked lists come in two (basic) forms, singly-linked and doubly-linked, with the only difference being that doubly-linked lists also retain a pointer to the prior node. Adding things to, and removing things from, the beginning of a linked list is extremely fast, since it's just assigning a pointer. Depending on the implementation (basically whether you store a tail pointer or not), the same can be said for operating on the end of the list. There are even algorithms for allowing concurrent modification of linked lists without locking, but that's a whole other level of complexity from what we're discussing here; if you're dying to dig into those details, search for "lock free linked list".

So, linked lists are dirt simple to implement, offer instantaneous access to the first & last elements in them and can even be implemented in a thread-safe lock-free manner; given all that awesomeness, you may be wondering why would anyone use any other kind of list? Linked lists have a handful of issues; first is that for every element in them, you have at least one pointer so you can get to the next item (two if it's a doubly-linked list and you're not storing the pointers packed into one). The second issue is that to access any particular element in the list, you have to access every element before it. Finally, there's the problem that linked lists are so not cache-friendly that it's probably more accurate to call them cache hostile. It's entirely possible that every single element that you access will mean a cache miss and, if you're phenomenally unlucky a TLB (Translation Lookaside Buffer; really important, but the details are tangential to this discussion) miss or even a page fault. In plain English, they can make your program very, very slow.

The other common implementation of lists backs then with an array. If you're using a C++ vector, or a Java ArrayList or Vector (the only difference between these two is that operations on Vector are all synchronized), you're using an array-backed implementation of list. An array-backed list can access any of its elements in constant time (i.e. it doesn't depend on how many elements are in the list) and it's extremely cache-friendly when accessed in order.

Sadly, array backed lists are also not a panacea. Insertions and deletions at the head of an array-backed list require shifting every element of the list. Also, the list can get full, requiring resizing it and potentially copying the elements (you may get lucky and have realloc succeed at just resizing the array in place). To reduce the cost of resizing, we double the capacity of the list whenever it's full, so that the time to add to the list is, on average, constant.

Stacks

Stacks are the "hidden" data structure; you use them all the time even if you're not aware of it. Every function invocation uses the stack. A stack is exactly what it sounds like, it's like a stack of plates, the last one that you put on it is the first thing that'll be removed. That's why they're also called LIFOs (Last-In First-Out). Stacks have a much more restricted set of operations than lists, supporting only push (to put things on top of the stack) and pop (to remove things from the top of the stack). It's fairly easy to implement a stack in terms of a list and, in fact, that's usually how they are implemented.

Queues

If you've ever had to stand in line for something, you're familiar with queues. They're a lot like lists, except they don't support random access, (except priority queues, which let some items cut in line) you only get things out in the order in which they were added.

Queues are pretty easy to implement with a list, however for performance we usually use a circular array. Basically, instead of shifting everything toward the front when we add an element, we just keep incrementing the position to write to, and then do a mod on the number to map it into the available positions. Basically, the position to write to just wraps around to start writing again at the beginning.

Sets

A set (in the context of data structures) is just a collection that does not allow duplicates. Unlike lists, queues & stacks, sets have no particular ordering. Sets generally support adding and removing elements and asking if an item is a member of the set. You can implement a set using a list, but that's fairly inefficient, so they're usually implemented with hash tables (which I'll get to shortly).

Bags

Bags are basically just sets that permit duplicates and are usually implemented the same way. There's really not much more to say about them.

More Advanced Data Structures

Now it's time to start digging in to some more advanced data structures; we're definitely not in Kansas anymore!

Graphs

I'm a bit hesitant to call a graph a data structure, since there's enough research on graphs to be a field of study unto themselves. That said, they are a way of organizing data, and some of the data structures that I'm going to discuss shortly are defined in terms of a graph, so here's a quick description (I'll discuss them in more depth when I cover graph algorithms). A graph is a series of nodes, called vertices (vertex is the singular) connected by edges; it's a lot like a subway system, where the stations are vertices and the rail lines themselves edges. Graphs of one variety or another are one of the most widely used data structures, in no small part because edges can have weights, which opens the way for us to answer a lot of interesting questions.

Graphs can be directed or undirected. If the graph is directed, then you can only traverse edges in one direction, like one-way roads. Some graphs also have loops (more formally, cycles) leading from a node back to itself via its children (not counting an undirected link to/from a child node); the ones that do are called cyclic graphs. Graphs without loops are called, reasonably enough, acyclic.

The two most common implementations for graphs are as adjacency lists & adjacency matrices. Adjacency lists store the nodes and edges explicitly; e.g. you may have an array of nodes, each of which with a list of the edges connected to it. An adjacency matrix is an n x n matrix storing whether each node has an edge to each other node. Adjacency lists are a nice, compact form for storing sparse graphs (that is, graphs that have relatively few edges connecting their nodes); adjacency matrices are well-suited to storing graphs where many nodes are connected by edges.

Trees

Trees are, formally, directed acyclic graphs with a single most ancestral node, called the root node. We call the nodes farthest from the root leaves. Trees come in lots of shapes and sizes, but one of the most common is the binary tree, and especially the binary search tree. A binary search tree has the handy feature that, for each node, its left child node (and all of that node's children) are less than it and, likewise, its right node and its child nodes are greater. That lets us search the tree for any particular element in O(log n) time (assuming the tree is balanced; i.e. the leaves are all at about the same distance from the root); I'll discuss exactly what O(log n) means in the algorithms post, but for now it's enough to know that that means that you have to square the number of elements in the tree to double the time that it takes to find an element.

The most obvious implementation of trees is almost identical to a linked list except, instead of a Next pointer, you have Left and Right pointers. A more clever, and efficient, implementation stores the nodes as elements in an array without needing the pointers. For any node in position n of the array (with the root starting at 0), the left child goes in position (2*n + 1) and the right child goes in (2*n + 2)

If you're interested in learning about other types of trees (and you certainly should be!) , you should look into tries (very similar to trees, but really useful for searching for strings), suffix trees, B-trees and B+ trees. Also worth looking into are red-black trees, which provide a mechanism for keeping trees balanced.

Heaps

Heaps are a form of (typically binary) tree with the property that the value of every node is less than the value of its children, which, by extension, means that it's less than all of its descendants. We can use that to sort a bunch of information or to select the top (or bottom) n elements (where n is however many elements you want). The trick to maintaining heaps as you remove elements is to select the lesser of the root's children and make it the new root, then repeat for the empty spot that it left (all the way down the heap). Adding an element follows a similar path, but backwards. Add it to the bottom, then if it's smaller than its parent, swap them and repeat until it's either larger than its parent or it's at the root.

Hash Tables

Hash tables (also called hash maps, maps and dictionaries) are a vitally important data structure. As their alternate names suggest, hash tables let you map one value to another, e.g. a word to its definition. The beauty of hash tables is that they let you retrieve the value that you associate with the key in (usually) constant time; that is, the time to retrieve an item is independent of how many items are in the hash table. Internally, a hash table is a list of "buckets". The buckets are where you put the element to which you're mapping the key.

The way that hash tables give such quick lookup times is that they apply a function (in the math sense of the word) to the key to generate a number. That function is expected to always return the same value for any given input (e.g. if you pass it the phrase "Bacon is awesome!" three times today, twice tomorrow, and seven times the day after, it should give you the same number back each time). It then takes that number, divides it by the number of buckets in the table, and then uses the remainder (this is called taking the modulo or modular arithmetic; most programming languages have an operator that'll do it for you) as the index of the bucket to put the element into. You're probably thinking, "Hey, what if two keys end up hashing to the same bucket?!" There are numerous ways of dealing with that, but the simplest are called chaining and linear probing. With chaining, you don't store the element directly in the bucket, you store a list in the bucket and just add the element to the end of that list. In linear probing, if you find that the bucket that the key hashed to is full, you try to fit it into the next bucket; if that's full, try the one after that and keep trying until you find an empty bucket (or you decide to resize your list of buckets). Linear probing can be a bit trickier, but due to cache effects, tends to be faster.

To implement a set with a hash table, just use the element to add to the set as the key and map it to some value; the exact value doesn't really matter. If the table contains the key, then the element is in the set. If you make the value an integer, though, you can keep a count and turn the set into a bag.

To Be Continued...

This post has gotten pretty long, but it's covered most of the basic data structures. In the next post (coming some bat time to the same bat channel), we'll cover some sorting algorithms.

Saturday, May 25, 2013

SSL Gently

A lot of people find SSL (and TLS) confusing; hopefully this will help clarify how it works. Before we get started, let's define some terms.

Symmetric Key Encryption: This is what you probably think of when you think of encryption. It's the oldest form of encryption and relies on a shared secret. Basically, if two parties have the shared key and no one else does, they can use that key to communicate without anyone else being able to read what they're saying. This has the benefit of being relatively fast, with the obvious downside of requiring a way to securely share a secret.
Asymmetric Key Encryption: Also known as public key encryption and PKI (though PKI is more than just asymmetric key encryption), asymmetric key encryption uses a pair of keys instead of a single key known by both sides. Of the pair of keys, one is called the public key and the other the private key. As the name suggests, you share the public key with the parties with whom you're communicating, but keep the private key private. Asymmetric key encryption uses math concepts like exponentiation, modular arithmetic and elliptic curves to allow you to encode information with one key and decode it with the other, but not allow anyone who doesn't have the other key to decode it. Don't worry too much if you don't understand modular arithmetic or elliptic curves; crypto libraries will take care of the math for you (while also protecting from vulnerabilities that could be introduced by coding errors). People who want to send information to you that only you can read would encrypt it with your public key and you would then decrypt it with your private key. On the other hand, you can encrypt something with your private key and then people can see that it really is from you by checking it against your public key.
Cipher Suite Basically: just a kind of encryption that you support, similar to announcing protocols that you understand. An example would be AES256, which just says that you're willing to use 256-bit AES (Advanced Encryption Standard)
Digital Signature: A digital signature is a way to prove that something came from you. The basic way that they work is you use a secure hashing algorithm to take a digest of the message that you want to send and then encrypt that digest with your private key. If you're not familiar with digests & secure hashing, all that it means is that you use a function that maps a stream of information of arbitrary length into a sequence of bytes of (frequently shorter) fixed size; the most important features of the function are that given any two messages, it is highly unlikely that they'll map to the same sequence of bytes and, if you make any change to a message, even if it's a single bit, the output value will be vastly different (the goal is for ~50% of the bits to change and for the changed bits to be spread throughout the result)
X.509 Certificate: Often just called a certificate, an x.509 certificate is a way to assert your identity by pairing some information about you (frequently a name, an organization with which you're affiliated, and a location), along with the public key of your public/private key pair. To keep people from messing with the name/key pair in transit, the certificate gets signed. It can either be signed by you, in which case it's called a self-signed certificate, or it can be signed by a certificate authority
Certificate Authority: Frequently abbreviated as CA, a certificate authority is a trusted third party; specifically, you trust the CA when it tells you that a certificate is owned by the entity that claims to own it. Technically anyone can be a CA if they set up a key with which to sign certificates, but that doesn't mean you have to trust them. As an example, if you have a friend named Bill, who tells you that the person you just met is Jill, then Bill is basically taking the same role as a CA.
Trust Chain: A trust chain is basically transitive trust. Basically, a CA doesn't necessarily just assert that a person is why they say they are, they can also assert that that person is trustworthy to identify other people for you. Continuing our example, if Bill tells you that he trusts Jill to identify people, and Jill tells you that a person you're meeting is Steve, then that's a trust chain. You trust Steve's identity because you trust Bill who trusts Jill who asserted that the person is Steve.
Trust Store: I'm not sure how widely this term is used outside of Java, but pretty much everything that does asymmetric key encryption has a similar concept. The trust store is just a central place to store the certificates (and public keys) of the CAs that you trust
Key Store: This stores your certificate and private key and, thus, needs to be protected. This & the trust store cause quite a bit of confusion for folks since, with Java, they're both implemented as keystores.

Ok, I admit, that may have been a fire hose of information, but hopefully you're still with me, because armed with that information the next bit is a fair bit simpler. I'm going to leave out a lot of details (like exchanging pre-master secret information) that's essential to implementing TLS, but just muddies the waters when you're trying to understand how it works at a high level. If you're looking for all of the nitty-gritty details, they're laid out in the TLS RFC.

Client Hello - The client sends a message asking to start the TLS handshake, announcing various bits of information like whether it supports compression and what cipher suites it supports
Server Hello - The server responds, confirming the use of compression and the cipher suite that'll be used
Server Certificate - The server presents its certificate (from its key store) to the client
The server asks the client for its certificate if it's doing mutual authentication
The server then sends a message that it's finished saying Hello
If the server requested the client's certificate, the client provides it
The server looks in its trust store for the CA that signed the client's certificate; if it doesn't trust the CA that signed the client's cert, it won't accept that certificate as valid
The client generates a message including a random number and encrypts it with the server's public key and then sends it to the server (basically everything sent to the server from this point until the end of the handshake will be encrypted using the server's public key)
The client then takes a digital signature of everything that has transpired so far (including the random numbers that have been passed back and forth) using the client's private key, and sends that to the server
The server then checks the digital signature provided by the client and, if it matches, then it knows that the client is in possession of the private key associated with the certificate
The client & server exchange ChangeCipherSpec messages that basically say that everything from this point will be encrypted with the symmetric key that they agreed upon during the handshake (they switch to symmetric key encryption because it's much faster)
The client & server exchange messages to say that they're finished with the handshake; periodically they'll re-send ChangeCipherSpec messages to change the symmetric key that they're using because reusing a symmetric key has a certain amount of risk and the more messages you send with it, the greater the risk

Sunday, May 19, 2013

Multithreading & Performance

If you're writing multithreaded code, you're probably doing it to improve performance; either that or you know who'll be maintaining the code and you really don't like them. That makes it all the more painful that making code multithreaded can potentially not only not help, but actually hurt performance if you're not careful. I'll try to cover some of the common performance pitfalls here and offer some recommendations to fix them.

Excessive Synchronization

By definition, only one thread can be hold a mutex (e.g. be in a particular synchronized block) at once, so whenever you synchronize you serialize your code. From that, we can see that the more of your code is in a mutex, the less benefit you're going to get from using multiple threads. This gotcha seems to be fairly common in web applications, especially in its most insidious form, the static synchronized method. When you see static synchronized in a web application, it's telling you that for the duration of that method you can have exactly one concurrent user; not exactly the height of scalability. Synchronized methods are painful because they acquire the mutex on method entry and don't release it until the method is done; adding to the cost is that they synchronize on this, so you can only be in one synchronized method of that instance at a time. Synchronized static methods kick it up a notch by synchronizing on the actual class object, so that only one thread can be in any static synchronized method of the class.

There are two aspects to avoiding excess synchronization. The first, and easiest, is to not hold a lock if you don't need to. For example, a method that splits a string on commas shouldn't need synchronized (unless it's doing so in a streaming fashion so that it's maintaining state whose scope is outside the method).

The second, and more complicated, aspect is increasing the granularity of your locks; i.e. introducing additional lock objects so that each one protects a smaller portion of the code. A concurrent hash map is probably the canonical example of replacing a coarse lock with fine locks. A thread-safe hash map could use one lock to protect its state, but that would limit interaction with it to one thread at a time. A more scalable implementation may use a separate lock per bucket, so that different threads can access the map simultaneously as long as they're not accessing the same bucket.

Sharing is (Not) Caring

Writing to shared state from multiple threads will lead to one of two things (in a multi-core system):

1) Lots of memory traffic and cache misses (poor performance)

2) Incorrect results because you don't have proper synchronization

If you're really unlucky, you get both poor performance and incorrect results. A simple example of where you may run into this would be with counters, either visit counters, performance counters or even simply summing the elements of a large list. Take the following code, for example:

public class Summation {
   private int sum = 0;
   private Lock lock = new ReentrantLock();
   private class SumCalculator extends Thread {
      public SumCalculator(List<integer> list, int start, int end) {
         for(int i = start; i < end; ++i) {
            lock.lock();
            try {
               sum += list.get(i);
            } finally { lock.unlock(); }
         }
      }
   }
}

The sum will be correctly calculated by the SumCalculator (assuming the list is split up correctly), however it's highly likely that performance will be as bad, if not significantly worse, than calculating the sum in a single thread. One very simple change, however, will dramatically improve performance. If the sum is moved into the thread (so that its scope is limited to a single SumCalculator), then there's no need for locking and less traffic on the memory bus. Of course, you still need to calculate the sum of the threads' sums, which could take a while if there are a lot of threads, but in that case the same approach can be applied to calculate their sums in parallel (at which point you should be looking at fork-join, since it's well suited to divide-and-conquer problems like that).

If you've been reading the Javadoc for java.util.concurrent.atomic, you may be wondering why not use AtomicInteger, since that would remove the need for the lock. The problem with AtomicInteger in this case is that it would probably experience worse performance than the locking approach because it wouldn't reduce memory traffic (in fact, it may greatly increase it) and it would be highly contended. The Atomic* classes use Compare-and-Set (CAS) instead of locking. What that means is that addAndGet (and all of the other mutating methods) get the value, apply the mutation and then atomically set the value if the value had not changed since the get. In pseudocode, it looks like this:


while(true) {

   val origValue = getValue()

   val newValue = origValue + delta;

   atomic {

      if(origValue == getValue() ) {

         setValue(newValue)

         break;

      }

   }

}

As you may have guessed, if the value is changing frequently that loop may execute many, many times, leading to potentially very poor performance.

Friday, March 1, 2013

Hodge Podge

This is just a hodge podge of short items; the 2nd part of the multithreading post will probably be up in a few days.

What's In a Name?

The methods in the Java standard libraries are generally well-named, but there are a few astonishingly irritating exceptions, and java.lang.Character seems to collect them.

Q: Which of the following returns true for a space, tab, newline and non-breaking whitespace: isWhitespace, isSpace, isSpaceChar.

A: NONE OF THEM!

isSpace and isWhitespace return false for a non-breaking space; isSpaceChar returns false for newlines and tabs. If you want to know if a character is a whitespace, you need isSpaceChar || isWhitespace. As an added kick in the shins, the \s regular expression class also doesn't match non-breaking space.

Keep on Moving (Conditionally)

Branch mispredictions on modern processors hurt. Badly. Because of that, anything that you can do to make branches (e.g. if statements & the conditionals on loops) easier to predict will generally make your code faster. Sometimes, though, you just reach the point where you can't make the branches easier to predict. In those cases, you can sometimes play some tricks to replace them with conditional moves/sets and pick up some performance. For example, here's the body of the main loop for merging two arrays in merge sort:

if(arr1[index1] < arr2[index2]) {  
   mergeInto[index3++] = arr1[index1++];
} else {
   mergeInto[index3++] = arr2[index2++];
}

and here's a version that replaces the branches with conditional moves/sets


int read = arr1[index1] < arr2[index2] ? arr1[index1] : arr2[index2];
mergeInto[index3++] = read;
int addTo1 = arr1[index1] == read ? 1 : 0;
index1 += addTo1;
index2 += addTo1 == 0 ? 1 : 0;

On an Ivy Bridge laptop running Java 1.7, the latter code is about 1.50 nanoseconds (0.0000015 ms) / element faster for the low, low price of making your code's intentions a little less obvious and the code a little harder to read. It should go without saying, but you shouldn't drop to this level of optimization unless your code is too slow, you've profiled it and know that your bottleneck is there and at least strongly suspect that the problem is branch mispredictions.

Singletonish

java.util.concurrent has a lot of neat classes, including the ever-useful ConcurrentHashMap. ConcurrentHashMap has a nifty little method called putIfAbsent that we can use to create singleton-ish objects. For example, if you need to run some expensive initialization once for each user who logs on, but you only want to do it once / user. The downside to putIfAbsent is that it takes an object, which would appear to meant that we have to instantiate the thing anyway. We can, however, take a page from Future's playbook and put an object that will instantiate the object.

One thing you'll have to keep in mind is that you must check the return value from putIfAbsent, otherwise you'll be defeating the whole purpose and reinstantiating every time. Basically, here's the general pattern you'll want to use:

Instantiator instantiator = new Instantiator();

Instantiator fromMap = theMap.putIfAbsent(theKey, instantiator);

theValue = fromMap != null ? fromMap.getTheValue() : instantiator.getTheValue();

Instantiator's just a run-of-the-mill object that builds the object once and returns it. You'll need to do some synchronization to make sure you only instantiate its object once, but that's fairly cookie-cutter code.

Tuesday, February 26, 2013

Crash Course on Multiple Threads and Java

Despite rumors to the contrary, multithreaded programming in Java is really easy -- unless you want your code to be fast and correct, of course (many other languages fare even worse, having all of the trickiness with none of the performance benefits). Having seen a lot of apps (and even some widely used web frameworks!) end up subtly wrong, horribly slow or both, I'm throwing this quick introduction together to hopefully save you the same fate.

The Root of the Problem

Shared Memory and Race Conditions

Java threads use shared memory (whether it be RAM, disk, or network) and having one thread modify that memory while another is accessing it, whether it be through a write or a read, is where we get the race conditions that'll make you tear out your hair. This can be fairly obvious like in the case of the following counter.

public class Counter {
   private int counter;
   public void increment() { counter = counter + 1; }
}

Race conditions can also be very non-obvious and more than a little tricky to track down, like one thread adding an entry to a HashMap while another tries to retrieve an entry from a HashMap. Just to make your life a little more fun, the behavior that that exhibits is the call to get goes into an infinite loop. Oh joy.

This Whole Program's Out of Order!

The other hidden gremlin that's just waiting to throw a monkey wrench into the works is that the compiler and the CPU are perfectly free to reorder your program almost willy-nilly to get better performance (and quite a few of them absolutely will if it means that they can hide the latency of a memory access). I don't have an example handy that's not also just a pure race condition; bugs that jump out because of reordering will make your head hurt and then result in lots of wailing and gnashing of teeth.

Stop the Madness!

Now that we know that race conditions are what makes our code wrong, how do we fix it? JSR-133 kindly gives us a beacon shining in the dark by defining Java's memory model, complete with the effects of memory synchronization points (and now you know why the keyword isn't something like critical or atomic, in case you're hanging out with people bored enough to ask!). I'll get into the JMM in a bit, but first a tip that'll make dealing with it (and getting your code correct in general).

Limit Your Scope!

The more restrictive the scope of your variables, the easier it is to reason about what modifies them and when. For example, if you don't need a variable to be at the class level (static), make it an instance variable and you've instantly reduced the ways that your variable can be messed with; if the object isn't shared, the variable gets thread-safety for free.

You can also limit your scope from a threading perspective (but not from a making the overall code easier to get right) by using ThreadLocal variables. These nifty gadgets have been around since the early days (Java 1.2), but they're not very commonly used because they're kind of weird. Looking at them, they just look like an object that holds another object, but the catch is that each thread ends up with its own values, so one thread calling set won't be visible to any other threads. The upshot to that is that if all you're modifying are thread locals, you do not need to synchronize since you're operating on memory that is not shared.

Controlling Time

The Java Memory Model (JMM from here on out) defines its ordering in terms of happens-before and happens-after relationships, which are exactly what they sound like. If x happens before y, then if you've seen the effects of y, you'll also see the effects of x. This gives us a pretty simple way to think about our programs without having to mentally run through every possible interleaving of threads.

We get our happens-before (and happens-after) relationships using locks, synchronized blocks and volatiles (oh my!). Volatile variables are a little bit weird, so let's start simple with locking and synchronized blocks. When you acquire a lock (or enter a synchronized block), anything that happened before the lock was last released on any other thread will have happened; i.e. you've established a happens-before relationship. When you release the lock (or exit the synchronized block), then everything that you've done up to that point is done and visible to other threads.

Volatiles are kind of weird in that a read from one has the same memory effects as acquiring a lock and writing to them has the same effect as a memory release. That means you can have the memory effects of acquiring a lock without the effects of releasing one, the effects of releasing a lock without acquiring it, or even the effects of a release before an acquire (though I can't think of any sane reason why you'd want to do the last of these).

Where volatile variables are most frequently used are as signalling variables, e.g. to tell a thread that it's time to start shutting down. As an example of using a volatile as a signalling variable, in the following code, if the variable shuttingDown is not declared volatile, then it's entirely correct for the JVM to go into an infinite loop.


while(!shuttingDown) {
   doSomeStuff();
}

In this post, we've gone over the basics of making the code correct. In the next post (which hopefully will be up in a few days), I'll go over some issues to keep in mind to actually make it fast.

Adam's Tech Thoughts