Thursday, June 26, 2014

Storing Strings, Succinctly

Alas, I lack an alliterative algorithm for squashing strings into less space, but hopefully I can make it up to you by sharing an improvement on a simple way of storing strings. To make it easier to follow, I'll build up to the full algorithm, step-by-step. 

Step 1: Reduce Randomness

Compression algorithms in general work better if you can reduce the randomness in your data. That may seem impossible, after all, the information is what the information is, but it is actually possible. The way that we're going to reduce randomness is by sorting our strings (e.g. putting them in alphabetic order). As an example, if we have the strings {cat, carrot, car}, we can store them as [car, carrot, cat]. That may not seem like it helps, but it does.

Step 2: Remove Redundancy

Looking at our list above ([car, carrot, cat]), we see that car is a prefix of carrot and carrot and cat share the prefix "ca". We can use that information to cut down on what we have to store, by recording the length of the shared prefix and then just writing the suffix.  What we end up with is [c,a,r,3,r,o,t,2,t], where 3 and 2 are the shared prefix lengths. Now, you may be asking how we can tell when one word ends and the next begins, and if you are, good, because that's a great question! One approach, that will probably be warmly embraced by C programmers, is to null terminate the strings, i.e. end them with a 0 (that's a zero, not an O). That approach works pretty well and leaves us with [c,a,r,0,3,r,o,t,0,2,t]

Step 3: Look at Lengths

English words, by and large, aren't terribly long. With one byte, like the one that we used to store the length of the shared prefix or the one with which we terminate strings, we can store up to the number 255. Most words aren't nearly that long and, unless you're dealing with protein names, you're not likely to encounter any of the few that are longer than that. To take advantage of that, instead of null-terminating, let's store the length of the prefix and the length of the suffix. Since we're storing lengths, we can also take advantage of words being short by storing the prefix and suffix lengths in one byte, together. If we devote 4 bits to the length of the prefix, we can avoid storing up to 15 bytes of prefix; if the shared prefix is longer than that, we can just duplicate the extra (4 bits isn't a magic number; I arrived at it through trial and error using a list of 99,171 English words, my initial guess at a good number was 3 bits). That leaves us with another 4 bits to store the remainder and a bit of a quandary if the remaining suffix is more than 16 bytes long (it would be 15 but, since the suffix must be at least 1 byte, we trade the ability to store 0 to make the remainder 1 smaller). Thankfully, another compression trick called (among several other things) variable byte encoding, comes to our rescue.

The standard way that variable byte encoding works is to use one bit to encode if there are more bytes in the number. We're going to deviate from that just a bit by instead taking up to the entire 4 bits to store the length. If the length is larger than, or equal to, 15 we store 1's in all 4 bits, subtract 15 from the remainder, and then use standard variable byte encoding to store the rest (even if that's a byte of all zeros). 

Results

So, what has all this bit-twiddling bought us? Encoding 99,171 words, which took a total of 938,848 bytes got it down to 323,370 bytes. Now, that's still not tiny, but it's about a 61% reduction in size. Gzip gets it down to about 256,000 bytes, which is considerably smaller, but this also has the benefit that it should be fairly easy to implement it such that it can encode & decode faster than gzip and permit more nearly random access to elements.

Hey, what do you know, I did manage to describe the algorithm using alliteration! Bonus! I hope this has been at least slightly entertaining, I hope it has been useful, and I hope it has been informative.