Thursday, October 13, 2011

RIP dmr

Dennis Ritchie has passed away, and the world is a lesser place for it.



#include <stdio.h>

int main(void) {
printf("Goodbye, dmr, and thanks for everything\n");
return 0;
}

Monday, February 14, 2011

High-Level Hadoop

At a high level, map reduce converts an input record to one or more output records, and then combines those output records into the final output records. The way that it does this can seem like magic, since so much of it is handled by the framework. Hopefully, this will demystify some of what's going on.

You start by configuring the job with a Configuration. The configuration tells Hadoop things like the location of your input, where to put your output, what class implements your mapper and reducer, and how the input and output are formatted.

The input formatter and its record reader take the input file (or files) and split them into records that are passed to the map class.1

For each record that the record reader produces, map gets called once to produce an output record. Those records get stored in memory for awhile, and then written to disk when memory runs low or the map-phase is finished.2

Once the mappers are done, their output gets sorted by its key, so that the records group together by key. Then those records are split up by key and sent off to reducers (all of the values for any given key will go to the same reducer, though).

The reducer then gets called with a key and all of the values for that key (as an Iterator). The reducer then does its thing with the values and writes them to its Context, which will send them to the output formatter, which deals with persisting the results.

Footnotes:
1. The files in HDFS are split into chunks and replicated across the cluster. For performance, Hadoop will try to run the mappers on the same machines that house the blocks of the file. Unless you're writing your own input format, this will be transparent to you.

2. You can also add a combiner, which does the same thing as the reducer and, in fact, can be the same class as your reducer. Adding one can potentially improve performance dramatically by drastically reducing the amount of data that has to be sorted and distributed to reducers

Sunday, February 6, 2011

Know When to Fold 'Em

The Map & Reduce approach to dealing with large data sets, as implemented in Hadoop, has its background in functional programming, where both are used to manipulate the elements of a collection. Specifically, map is used to transform one element from the collection to some other type, for example to turn a string representation of an integer into an actual integer. Reduce, on the other hand, is used to combine all of the elements of the collection into a single result (which may, in turn, be another collection).

With just these two “primitives”, you can do a lot of powerful number crunching (or string manipulation, as the case may be), but what if you've got a lot of data coming in constantly and you need to be able to answer questions quickly? (Hadoop tends to have the throughput of a delivery truck full of hard drives and the latency of a delivery truck full of hard drives)

First, let's tackle the delivering-results-quickly part of the problem. Traditional RDBMS' handle this by building indices on the data for the fields that are most frequently searched. Fortunately, there's absolutely nothing that says that we can't do that with Map-Reduce. By building indices with map-reduce, we can then answer questions (that we know about ahead of time) about the data much more quickly. We can also calculate things like sums and averages on the data, so that answering analytics questions is just a simple value lookup.

Now, on to the constant influx of data problem. In functional programming, reduce has a “sister” function called fold that does the same thing as reduce except it takes a starting value to reduce (or fold) into. With fold, you can take the results that you calculated earlier and incorporate them into the new data (or vice versa). So, right now you may be thinking, “Well hello Mr. Fancy Pants. I've got news for you pal, Hadoop doesn't support fold, just reduce!” This is where the relationship between fold and reduce comes in (warning, LISP-like pseudocode follows); (fold startingValue list) has the same effect as (reduce (cons startingValue list)). We can take advantage of that equality by having a mapper that generates the previous run's results as input to a reducer, which combines it with the new data's results.

Wednesday, January 26, 2011

Not Your Type

Not too long ago, I was asked by a friend why I like Scala's type system and how it helps develop software faster (relative to dynamically typed languages, like Ruby). The short answer is that it doesn't; the slightly longer answer is that that isn't the point. An effective type system is more about making the computer help you get your code right, because throwing $125,000,000 down the drain when one person uses metric and the other English units looks bad on the next budget request.

I suspect that most detractors of type systems are doing so from the perspective of weak type systems (like Java's), which really does get in the way as much, if not more, than it helps. A more powerful type system (e.g. Haskell's or Scala's), however, can actually make it a compile error to make some subtle errors. As an example, here's some Scala to make Velocity a type defined by Meters/Seconds:


class Meters(val dist : Float) {
def /(time : Seconds) : Velocity = { new Velocity(dist/time.magnitude)}
}

class Seconds(val magnitude : Float)

class Velocity(val magnitude: Float) {
override def toString() = magnitude + " m/s"
}

class Feet(val magnitude : Float)

With that definition, you can define something of type Meters, divide it by something of type Seconds and get a Velocity. Ok, so far so good, but that doesn't seem to be that helpful... until you find that you can not multiply the Meters by a typeless unit or accidentally by Feet! Now, you may be thinking, "Hey, that's still getting in my way! I have a parameter in Feet and I don't want to bother changing it to meters!" That's where the second trick up Scala's sleeve comes in, by providing a way to implicitly convert from Feet to Meters.

By adding this little snippet of code:
implicit def feetToMeters(feet : Feet) = new Meters(feet.dist * 0.3048F)
we can now divide Feet by Seconds and get Velocity in meters / second!
So now, these two statements print the same value:
println(new Meters(1000)/new Seconds(60))
println(new Feet(3280.8399F)/new Seconds(60))


And helping avoid turning a Mars Orbiter into a Mars Crasher is just one more reason why I really like Scala.

Saturday, January 8, 2011

QuickBuzz

A few of my friends use FizzBuzz to screen candidates, and that's a good thing because it's really good at weeding out the completely incompetent. The problem comes when you need someone who's more than just competent, you need someone who actually knows something. In those cases, I propose using Quick sort as a 2nd level FizzBuzz problem (or even a replacement for FizzBuzz). Quick sort is simple enough that it can be implemented in almost no time and almost no code by anyone who knows their programming language. It also has the benefit of opening the door for follow-on questions, like how do you avoid the algorithm taking n^2 time if the list is in order or what do you do if the list is too big to fit into main memory.

Wednesday, January 5, 2011

The New Alchemy

Software development has given rise to entirely new alchemists, with entirely new approaches, including the "Powerful Process" and "Magical Language" schools. The one thing that both of the schools have in common is a belief that there exists some incantation or ritual that will transmute incompetent developers into great developers. They'd be better served trying to turn hydrogen into gold, since that's at least physically possible (if horrifically impractical). Let's look at these schools a little closer and we'll see why ITIL and CMMI (and a lot of "agile" processes!) are incompatible with Agile, and why Paul Graham was right and wrong about LISP being a secret weapon in software development.

Arcane Rituals: The Powerful Process School
Acolytes of this school believe that if they can just find the one true process and follow it with sufficient rigor that it will guarantee that they will be guaranteed success in all of their software development endeavors. The holy grail of this school would be to find the process that lets them say, with a high degree of certainty, that it will take x person weeks to build a particular system, regardless of who comprises their development team. In effect, magically turning even incompetent developers into highly productive developers. It's this fixation on process over people that makes them ideologically incompatible with Agile. Sadly, even many proponents of Agile fall prey to these beliefs. You can tell if an "Agilist" has joined this school by observing their response to failed projects. If the response is to defend the process by saying, "You did it wrong! You didn't do thing x as prescribed by the process!", then you know you've met an agile alchemist.

Occasionally, processes are put into place that do succeed in making the composition of a team a minor factor in how long it takes to develop software. Unfortunately, this isn't accomplished by improving the worst developers, but by dragging everyone down to their level. There is some good news, though! The best developers aren't held down at severely depressed productivity for long; they leave and make predicting the outcome of future efforts easier by leaving them with no chance of success.

Words of Power: The Magical Language School
Developers are often counted amongst the acolytes of this school, constantly searching for the programming language, IDE or new technology that'll make hard problems easy to solve and easier to implement. That's certainly understandable, since a lot of things in software development are really hard and many developers love tinkering with new things. Certain commercial entities exacerbate the problem to profit from the sales of books, tools, and software by hyping every new thing that comes along. Unfortunately for developers (and fortunately for snake-oil salesmen), some problems are just hard! No tool or language will ever make the intrinsically hard problems easy, at best they can stay out of the way and let the easy parts of the problems be easy.

Earlier, I mentioned that Paul Graham was right and wrong about LISP being a secret weapon, and you may be wondering just how that could be. To be sure, LISP is a very powerful language, very well suited to extending with powerful DSLs and someone comfortable with it could be extremely productive. That said, LISP alone isn't enough to make one team an order of magnitude more productive than another. What does let LISP make teams significantly more productive is that it's a great filter. Requiring LISP automatically rules out everyone other than really good developers. Since LISP is not a language that's a sure-fire path to a job, only people who actually care about software development know anything about it. Adding to that, LISP has a fairly high barrier to entry, by requiring an understanding of functional programming and recursion. Basically, any team that's using LISP is most likely a 10x team, and that is the secret weapon.

So, if neither languages or processes can work the magic, what _can_ we use to improve our odds of success at software development? The answer is simple; hire really good developers who work well together. Then tell them what you need them to do, and work with them to clarify any issues that come up and to make sure that they know when needs are changing as soon as possible.