Sunday, May 19, 2013

Multithreading & Performance

If you're writing multithreaded code, you're probably doing it to improve performance; either that or you know who'll be maintaining the code and you really don't like them. That makes it all the more painful that making code multithreaded can potentially not only not help, but actually hurt performance if you're not careful.  I'll try to cover some of the common performance pitfalls here and offer some recommendations to fix them.

Excessive Synchronization

By definition, only one thread can be hold a mutex (e.g. be in a particular synchronized block) at once, so whenever you synchronize you serialize your code. From that, we can see that the more of your code is in a mutex, the less benefit you're going to get from using multiple threads. This gotcha seems to be fairly common in web applications, especially in its most insidious form, the static synchronized method. When you see static synchronized in a web application, it's telling you that for the duration of that method you can have exactly one concurrent user; not exactly the height of scalability. Synchronized methods are painful because they acquire the mutex on method entry and don't release it until the method is done; adding to the cost is that they synchronize on this, so you can only be in one synchronized method of that instance at a time. Synchronized static methods kick it up a notch by synchronizing on the actual class object, so that only one thread can be in any static synchronized method of the class.

There are two aspects to avoiding excess synchronization. The first, and easiest, is to not hold a lock if you don't need to. For example, a method that splits a string on commas shouldn't need synchronized (unless it's doing so in a streaming fashion so that it's maintaining state whose scope is outside the method).

The second, and more complicated, aspect is increasing the granularity of your locks; i.e. introducing additional lock objects so that each one protects a smaller portion of the code. A concurrent hash map is probably the canonical example of replacing a coarse lock with fine locks. A thread-safe hash map could use one lock to protect its state, but that would limit interaction with it to one thread at a time. A more scalable implementation may use a separate lock per bucket, so that different threads can access the map simultaneously as long as they're not accessing the same bucket.

Sharing is (Not) Caring

Writing to shared state from multiple threads will lead to one of two things (in a multi-core system):
1) Lots of memory traffic and cache misses (poor performance)
2) Incorrect results because you don't have proper synchronization

If you're really unlucky, you get both poor performance and incorrect results. A simple example of where you may run into this would be with counters, either visit counters, performance counters or even simply summing the elements of a large list. Take the following code, for example:


public class Summation {
   private int sum = 0;
   private Lock lock = new ReentrantLock();
   private class SumCalculator extends Thread {
      public SumCalculator(List<integer> list, int start, int end) {
         for(int i = start; i < end; ++i) {
            lock.lock();
            try {
               sum += list.get(i);
            } finally { lock.unlock(); }
         }
      }
   }
}

The sum will be correctly calculated by the SumCalculator (assuming the list is split up correctly), however it's highly likely that performance will be as bad, if not significantly worse, than calculating the sum in a single thread. One very simple change, however, will dramatically improve performance. If the sum is moved into the thread (so that its scope is limited to a single SumCalculator), then there's no need for locking and less traffic on the memory bus. Of course, you still need to calculate the sum of the threads' sums, which could take a while if there are a lot of threads, but in that case the same approach can be applied to calculate their sums in parallel (at which point you should be looking at fork-join, since it's well suited to divide-and-conquer problems like that).

If you've been reading the Javadoc for java.util.concurrent.atomic, you may be wondering why not use AtomicInteger, since that would remove the need for the lock. The problem with AtomicInteger in this case is that it would probably experience worse performance than the locking approach because it wouldn't reduce memory traffic (in fact, it may greatly increase it) and it would be highly contended. The Atomic* classes use Compare-and-Set (CAS) instead of locking. What that means is that addAndGet (and all of the other mutating methods) get the value, apply the mutation and then atomically set the value if the value had not changed since the get. In pseudocode, it looks like this:

while(true) {
   val origValue = getValue()
   val newValue = origValue + delta;
   atomic {
      if(origValue == getValue() ) {
         setValue(newValue)
         break;
      }
   }
}


As you may have guessed, if the value is changing frequently that loop may execute many, many times, leading to potentially very poor performance.

No comments: