Sunday, February 1, 2015

Why Go Columnar?

If you've had to deal with storing data over the past few years, there's a good chance that you've heard people talking about columnar data stores. A lot of the NoSQL data stores are columnar and most of the old guard RDBMS vendors are even adding support for hybrid columnar/row-based data stores. It's gotten to the point that if you announced the release of a database that doesn't have at least hybrid columnar support you'd have people either collapsing from a terrible case of the vapors or deriding your database as a dreadfully gauche homage to obsolescence! I have no idea why you'd announce a new database to a room full of art critics and people from antebellum Georgia.

You may be wondering why so many people care so much about columnar stores, considering whether you store data in a columnar format or row-wise format, you're still storing the same amount of data. Before I get into the gory details of how a columnar store can be better than a row-based store, though, I'm going to explain just what it means for something to be a columnar data store.

For this example, we'll use the following database of people with their surname, first name and year of birth.

{
   ("Addams", "Morticia", 1977), 
   ("Addams", "Gomez", 1973), 
   ("Addams", "Pugsley", 2000),
   ("Addams", "Wednesday", 2003)
}

In a non-columnar (or row-wise) store, the data would be stored with each entry stored in one block. It would look something like this:

Addams0Morticia0x07B9Addams0Gomez0x07B5Addams0Pugsley0x07D0Addams0Wednesday0x07D3

What makes a columnar store different is that it splits up the records and stores each type of data together; i.e. it stores all of the surnames together, all of the first names together and all of the years of birth together. In a columnar store, the data would be stored like this:

Addams0Addams0Addams0Addams0Morticia0Gomez0Pugsley0Wednesday0x07B9x07B5x07D0x07D3

Now that we've covered what a columnar store is, you're probably thinking either, "Ok, that's nice. So what?" or "Are you mad?! You've lost spatial locality and will have to do multiple reads to pull a single record!" Ultimately, it comes down to economics: the difference in speed and cost between RAM and hard drives (even solid state drives) is astonishing. A 2 TB hard drive will set you back about $100 and can be found in fairly low-end PCs. 2 TB of RAM will set you back in the ballpark of about $50,000 and you'll only see it in high end servers. On the other hand, the RAM is about a million times faster than the disk. Even though the columnar store and the row-wise store have the same amount of data, a closer examination shows that the columnar store has its redundancy more tightly packed. We can use that to encode the columns separately and save space. For example, we see that there's only one value in surname, so we can encode that as a 1 and get:

Addams11111Morticia0Gomez0Pugsley0Wednesday0x07B9x07B5x07D0x07D3

Since we're storing our data in a columnar fashion, we can also take advantage of being able to use different encodings for different fields. For example, the year of birth field's minimum value is 1973, so instead of having to store those as 2-byte values, we can rebase them to having 1973 as 0, leaving us with:

Addams11111Morticia0Gomez0Pugsley0Wednesday0x04x00x17xA0

By applying those few simple transformations, we've got the size down by about 25%, even on this tiny data set. We could definitely compress it further by using run-length encoding on the surname (so we'd just encode Addams4), but this has the benefit of being able to operate on the compressed data directly, which is significantly more efficient, both in terms of speed and memory usage.

Now that we know how this works in theory, here's how it works in practice.

EncodingSize in Bytes
Raw Non-columnar Data10,192,244
GZipped Non-columnar Data1,031,028
Raw Columnar Data10,192,244
GZipped Columnar Data90,464

The columnar data storage compressed an order of magnitude better than the row-wise! Of course, we've got the obligatory caveat that your mileage may vary, depending on your data and the compression algorithm that you choose. I wouldn't recommend using gzip in general, since you have to decompress everything up to the record that you want to retrieve the record (you can ameliorate that somewhat by compressing in blocks, so you put an upper bound on how much you have to decompress); I only used gzip because it was convenient.

Of course, like many technologies, columnar stores aren't all rainbows & sunshine. If you recall, I mentioned earlier that storing data in a columnar format wreaks havoc on spatial locality. Since spatial locality is so important to performance in modern computers, what do we do? Lose the performance benefits of locality? Lose the performance benefits of storing more of our data in RAM? No! We rebel! We reject rigid hierarchies! We embrace the hybrid approach!  Ahem, sorry, I got a bit carried away there. If we stick to compression approaches that don't require decoding the n-1 records to get the value of the nth record (or, better still, that allow you to work with the data in its compressed form), we can encode the columns as if they were being stored in a columnar format, but then write them out concatenated with the other values of their record. This way, we end up with a data store that looks like this:

Addams11Morticia0x041Gomez0x001Pugsley0x171Wednesday0xA0

Using this storage scheme does have the downside of making it harder to scan a column for a particular value, but ultimately that's why we have indexes.





No comments: