Monday, January 11, 2016

Crypto Kindly

Encryption can be a daunting topic, with symmetric keys, asymmetric keys, public and private keys, RSA and Diffie-Hellman, HMACs and hashing algorithms, it can seem like too much for anyone to ever understand! To try to cut through the confusion, I've put together this brief overview of some of the relevant topics. To keep things approachable, I'm leaving out a lot of the details that you'd need to actually implement any of these algorithms, but it's usually best to use previously written and tested libraries anyway, since subtle things like differences in the time to complete an operation can be used to break encryption.

Hashing vs Encryption

The first thing to get out of the way is the difference between hashing and encryption. One of the most obvious differentiating factors between the two is that if you encrypt something you generally intend to get that information back, whereas if you hash it, you explicitly do not want to be able to get the data back from the hash. 

Another important difference between the two is that the result of hash algorithms is constant, regardless of how much or how little data you put through them; encryption, on the other hand, produces output that's at least as large as its input.

In a perfect world, a hash algorithm would have an irreversible (that is, given a hash it's impossible to determine what data produced it) mapping from input to output, with no two input values producing the same output value. Unfortunately, since there are infinitely many potential input values to a hash algorithm and a finite number of output values (since its length is constant), for any particular input there may be infinitely many other input values that will produce the same output. When two different input values produce the same output value, we call that a collision. Making intentionally creating a collision as hard as possible is one of the key design goals of hashing algorithms, since being able to create collisions at will leads to attacks against most of the uses of hashing. When security researchers refer to a hashing algorithm as being broken, what they usually mean is that it has become relatively easy to find collisions. 

Hashing is most often used to either verify that a piece of data has not changed or to obfuscate data like passwords so that you can verify that a given value matched a prior known value without having to store that value. 

Verifying that data hasn't changed using a hash involves combining the hash with variants of encryption. If the party sending the data has a secret value that it shares with the party receiving the data, you can use a hash-based message authentication code (HMAC) to make a verifiable tag for the data. Basically, the way that HMAC works is by mixing the secret in with the data and then hashing the lot, potentially with multiple rounds of the hash algorithm. The receiving party can recalculate the hash with the secret and, if they match, know that whoever sent the data also had the shared secret.

If you can't reliably share a secret between the sender & receiver, another option is to use asymmetric key encryption (I'll explain that in more detail a little bit later). To create the tag that will allow verification with asymmetric key encryption, you generate the hash and then encrypt it with your private key. The receiving party gets your public key through some means, e.g. by looking it up in LDAP or pulling it from an x.509 certificate, and uses that to verify that the hash was encrypted with the private key.

Symmetric vs Asymmetric Key Encryption

Symmetric & asymmetric key encryption get their names from whether you use a single key for both encryption & decryption (symmetric key) or one key for encryption and a different one for decryption (asymmetric key) -- although either key in an asymmetric key pair can be used for encryption, changing the effect of the encryption from privacy (using the public key) to verifying the sender (using the private key). 

Symmetric key encryption typically involves "small" keys (e.g. 256 bits) and tends to be relatively fast, especially when implemented in hardware. A symmetric key algorithm typically involves several iterations (called rounds) where the key is mixed in with the data (e.g. via bitwise xor), the bits of the data are shuffled, bytes are swapped based on a lookup table and the key is mutated (any particular algorithm may only perform a subset of those and they may be done in other orders). Generally, all of the steps in symmetric key encryption are fairly inexpensive to perform in software and can "easily" be implemented in hardware, like the AES-NI instructions, leading to the relatively low cost of symmetric key encryption.

Asymmetric key encryption generally revolves around math operations that are easy to perform but hard to reverse, e.g. modular arithmetic (basically divide two numbers and the result is the remainder; e.g. 5 mod 3 is 2). Some of the most common asymmetric key algorithms also take advantage of exponentiation, because xyz = xyz, though other algorithms take different approaches, like elliptic curves.  Since the security of exponentiation & mod based algorithms is based entirely on factoring numbers being computationally expensive, these algorithms have to use large key sizes (frequently 2048-bits or more). The computational cost of encryption using an exponentiation & mod style algorithm is linear in the number of bits in the key (exponentiation is log2 of the key, but the exponential in the number of bits, so overall it's linear in the number of bits). The fact that these algorithms take time proportionate to the key length and require large keys makes them tend to be significantly slower than symmetric key algorithms (elliptic curve algorithms are a different beast).

With asymmetric key encryption, you have a public key, which you share with everyone else (i.e. make public) and a private key that you keep private. When determining which key to use for encryption vs signature, the rule of thumb is that you use the key opposite to the one that the entity accessing the data would have. For example, if you're signing something, the other party will have your public key, so you use the private key. If you want to send something to someone such that only they will be able to read it, you'd use their public key (since they have their private key).

Symmetric + Asymmetric Key Encryption

Symmetric and asymmetric key encryption both have pros and cons; symmetric key is fast, but requires that you have found some way to securely share a key. Asymmetric key is slow, but doesn't require a shared key. We can (and do) combine the two to get the best of both worlds by using asymmetric key encryption to create a secure channel to transmit a symmetric key. The symmetric key is small, so the cost of the asymmetric encryption is limited.

Authenticated Encryption

If you need to encrypt your data and ensure that the data hasn't been modified, you can calculate a hash over the encrypted data and sign it, or you can calculate an HMAC over the encrypted data (always calculate the hash over the encrypted data, otherwise you leave yourself open to denial-of-service attacks and potentially timing attacks that can expose your key). Signing the whole stream means you can't verify the data until you've received it all, which may take too long or use too much memory. Calculating HMACs as you go will let you verify & decrypt the data as you go, but it's a bit of a pain to implement correctly. To simplify life, cryptographers created AEAD (authenticated encryption with optional additional data) modes for symmetric key algorithms like AES. These modes make it possible to use a symmetric key algorithm with more than one block of data while protecting underlying patterns in the data and producing a "tag" that can be used to verify that the data was received unaltered from the sender. One of these AEAD modes, Galois Counter Mode (GCM), also has the added benefit of allowing for encrypting multiple blocks of data in parallel, since the IV for a block does not depend on the result of encryption the prior block (unlike many modes, e.g. Cipher FeedBack, also known as CFB mode). The downside of AEAD modes, including GCM, is that they increase the amount of data to be stored/transmitted by the tag length, which may double the size of the input. If you can afford the increase in size, however, the added security (and performance) makes using an AEAD mode a good choice.

Tuesday, September 15, 2015

A Tale of Two Methods

Python has some interesting aspects related to functional programming. Its classes and class instances both have methods that can be assigned to other variables and passed around and they can have their methods replaced (a technique, called monkey patching, that usually leads to pain and suffering).

One issue with assigning a method from a class or instance to a variable that can cause significant confusion is that the behavior is different depending on whether the method's being taken from the class or from an instance of the class. Once you know what's going on, the behavior's fairly intuitive, but until then it can be really confusing.

For the sake of discussion, let's define a class, Foo.

class Foo:
    def __init__(self, name):
        self.name=name
        self.aka = "who knows?"

    def bar(self):
        print("Greetings, from {0} (aka {1})".format(self.name, self.aka))

    def set_aka(self, aka):
        self.aka = aka

Then we can instantiate a Foo:
foo1 = Foo("Batman")

Now that we've got Foo and an instance of Foo, let's take their methods and see what we get!

foos_bar = Foo.bar
foo_1s_bar = foo1.bar

foos_bar: <unbound method Foo.bar>
foo_1s_bar: <bound method Foo.bar of <__main__.Foo instance at 0x7f8ae4028b48>>

I've highlighted an important distinction between the two; the method taken from Foo, is an unbound method, whereas the one from foo1 is a bound method. So, what exactly does that mean? Well, let's try calling the methods and see!

foos_bar()

TypeError: unbound method bar() must be called with Foo instance as first argument (got nothing instead)

So that's what that unbound meant; it has no "self" tied to it, so there's no argument being passed for self. Given that that's an unbound method, but foo_1s_bar was a bound method, it seems likely that its behavior will differ. Let's see!

foo_1s_bar()

Greetings, from Batman (aka who knows?)

Ok, that worked, but why? I can't claim to know the details of Python's implementation, however the behavior suggests that the call to Foo is creating an object and using currying to create a method for each of the class' methods that have a
self
bound into them; i.e. foo1's bar is a method that takes no arguments, because it has already had foo1 bound to it as self. We can test that by instantiating another Foo and setting its bar method.

foo2 = Foo("Joker")
foo2.bar = foo_1s_bar
foo2.bar()

Greetings, from Batman (aka who knows?)

Aha! foo_1s_bar retains its self, even if you attach it to a different instance of Foo! Now that we've seen what the bound method is all about, what happens if we pass an instance of Foo to foos_bar?

foos_bar(foo1)
foos_bar(foo2)

Greetings, from Batman (aka who knows?)
Greetings, from Joker (aka who knows?)


We've mostly covered the difference between bound and unbound methods, but there's one more oddball case. What if we attach foo1's bar to Foo and then instantiate another Foo?

Foo.bar = foo_1s_bar
foo3 = Foo("The Shadow")
foo3.bar()

Greetings, from Batman (aka who knows?)

Out of curiosity, I also tested calling bar on a Foo that was instantiated before replacing Foo's bar (foo2; the one set to be Joker). Reasonably, you might expect that it'd call that foo's bar, totally ignoring the change to Foo. If it was doing currying to create the methods, then it definitely wouldn't have any effect. Alas...

Greetings, from Batman (aka who knows?)

Well, that rules out currying at instantiation as the implementation of bound methods, and gives us a major warning about the dangers of replacing unbound methods!

There are two more things to test: setting a bound method from one instance onto another instance and setting a bound method onto a class as a method that wasn't defined. The results of the latter aren't exactly a shocker now that we've seen what happens when you replace an unbound method with a bound one. It has the exact same behavior. The former, on the other hand, could go either way. Unfortunately, it does not. Its self is still bound to the object from which the method was referenced.

Thursday, February 26, 2015

Trust Chains in SSL: Distributing Identity Verification

Knowing with whom you're dealing is an important part of any transaction, and it has been since time immemorial. Over the years, we've developed a variety of ways to verify one another's identity, with varying levels of rigor.

Identity Verification in the Non-electronic World

You're probably familiar with a lot of the ways that we verify identities outside of the electronic world. In cases where there's little to lose if we're misled and/or the other party has little to gain by misleading us, we just tell each other our names and just assume that the person is who they say they are. When the stakes get a little higher, we look to our friends to verify someone's identity, for example by introducing us in person. At the highest level of rigor, we require formal identification, usually from some entity like the government (for example, requiring a person to show their driver's license or passport).

From these ways of identifying one another, we've already seen examples of both federated trust (I trust you are who you claim to be because my friend said that's who you are and I trust my friend) and using a trusted third party (I trust you are who you say you are because the government said you are). It probably doesn't come as a huge surprise, but we end up using both in the electronic world.

Moving to the Electronic World

Verifying an entity's identity in the electronic world poses some new challenges; after all, you can't exactly ask a website for its driver's license! To deal with that, a standard named x.509 was co-opted and combined with SSL to allow websites to verify their identity. x.509 is basically a way of tying an identity to a public key (here's my previous post explaining how SSL works, including info on public & private keys). To make it work, an x.509 certificate combines information identifying an entity with a public key (and more information as well, but that's the relevant bits), then takes a secure hash of that combination and encrypts it with the private key that goes with the public key in the certificate. Taking that hash and encrypting it is known as signing the certificate, and a certificate that is signed by itself is called a self-signed certificate.

A self-signed certificate allows a web site to say who it is, with exactly the same level of assurance as a person just telling you their name. Considering we don't trust that level of assurance in important interpersonal dealings, we certainly don't trust it on the internet. To meet the need for a higher level of assurance, we have certificate authorities (usually just referred to as a CA). The service that a CA provides is that they'll sign your certificate for you, asserting that you are who you claim to be; they're basically filling the same role as the government does with the IDs that they provide. When you connect to a site that's providing a certificate signed by a CA, your browser verifies that the site has the private key tied to that x.509 certificate (as part of the SSL handshake) and then verifies that that x.509 certificate was signed by the CA.

If you need to identify a lot of servers, e.g. you're a large corporation that needs to secure email servers, web servers and database servers, it can get very inconvenient (not to mention costly!) to send off to have each certificate signed individually. To deal with that, you can get a certificate that lets you act as the trusted third party that verifies the identity of a set of servers. For example, if you're in charge of example.com, you can get a certificate that would let you sign the certificates for smtp.example.com, pop3.example.com and www.example.com as if you were a CA. If you then hits www.example.com you'll get a certificate from www.example.com that's signed by example.com, which is then signed by a CA. You then verify that certificate by walking back up the chain, verifying each certificate until you get to a root CA. This is called a trust chain and is the internet equivalent of a friend that you trust introducing you to someone and telling you that you can trust them to introduce you to other people.

Who Gets to be a Certificate Authority?

In the real world, it's relatively easy to figure out who will be accepted as a central entity for verifying identities. Governments give ID cards and we (usually) accept that they're a legitimate authority to verify someone's identity. We also have a a parallel to trust chains in the form of ID cards from universities, employers, etc, where we trust that they have verified a person's identity, at least in the context of the verifier. For example, if someone has a college ID, we assume that they have been verified as students at that college.

In the electronic world, we have root certificate authorities. Your web browser comes with a lot of them, some of them you may trust, others you may not, many you'll never encounter being used in your day-to-day browsing. You can add CAs to your list of trusted authorities or remove them as you see fit, though most people just accept the browser vendor defaults. Being in the default list of trusted certificates makes those CAs de facto trusted third parties for verifying identities.

Breaking Chains & Violating Trust

There's a good chance that you've heard about Superfish, the malware that shipped on some Lenovo laptops. Superfish installs a proxy (a piece of software that sits between your browser and the internet at large) that intercepts your web traffic to inject advertisements into sites that you visit. Normally, TLS (the successor to SSL) would protect you from software like that, but Superfish installs its own certificate as a root CA. Since it makes itself a root CA, Superfish is able to generate a certificate for any site that you visit and sign it so that it appears to be a legitimately signed certificate, prompting absolutely no warning from your browser.

On its own, Superfish is annoying and potentially dangerous if their ad server was compromised, but design errors make it exponentially worse. Superfish uses a single signing certificate for all installations. That means that if anyone gets their hands on that certificate, they can intercept and modify the web traffic of anyone with the software installed on their computer. That means they could easily get the passwords to your email, social media sites and online banking accounts. Even worse (yes, amazingly, it can get worse!) they can inject malicious content into web sites that you trust, potentially even exploiting vulnerabilities in your browser or plugins to infect your computer with additional persistent malware, leaving you even more vulnerable than before.

Sunday, February 1, 2015

Why Go Columnar?

If you've had to deal with storing data over the past few years, there's a good chance that you've heard people talking about columnar data stores. A lot of the NoSQL data stores are columnar and most of the old guard RDBMS vendors are even adding support for hybrid columnar/row-based data stores. It's gotten to the point that if you announced the release of a database that doesn't have at least hybrid columnar support you'd have people either collapsing from a terrible case of the vapors or deriding your database as a dreadfully gauche homage to obsolescence! I have no idea why you'd announce a new database to a room full of art critics and people from antebellum Georgia.

You may be wondering why so many people care so much about columnar stores, considering whether you store data in a columnar format or row-wise format, you're still storing the same amount of data. Before I get into the gory details of how a columnar store can be better than a row-based store, though, I'm going to explain just what it means for something to be a columnar data store.

For this example, we'll use the following database of people with their surname, first name and year of birth.

{
   ("Addams", "Morticia", 1977), 
   ("Addams", "Gomez", 1973), 
   ("Addams", "Pugsley", 2000),
   ("Addams", "Wednesday", 2003)
}

In a non-columnar (or row-wise) store, the data would be stored with each entry stored in one block. It would look something like this:

Addams0Morticia0x07B9Addams0Gomez0x07B5Addams0Pugsley0x07D0Addams0Wednesday0x07D3

What makes a columnar store different is that it splits up the records and stores each type of data together; i.e. it stores all of the surnames together, all of the first names together and all of the years of birth together. In a columnar store, the data would be stored like this:

Addams0Addams0Addams0Addams0Morticia0Gomez0Pugsley0Wednesday0x07B9x07B5x07D0x07D3

Now that we've covered what a columnar store is, you're probably thinking either, "Ok, that's nice. So what?" or "Are you mad?! You've lost spatial locality and will have to do multiple reads to pull a single record!" Ultimately, it comes down to economics: the difference in speed and cost between RAM and hard drives (even solid state drives) is astonishing. A 2 TB hard drive will set you back about $100 and can be found in fairly low-end PCs. 2 TB of RAM will set you back in the ballpark of about $50,000 and you'll only see it in high end servers. On the other hand, the RAM is about a million times faster than the disk. Even though the columnar store and the row-wise store have the same amount of data, a closer examination shows that the columnar store has its redundancy more tightly packed. We can use that to encode the columns separately and save space. For example, we see that there's only one value in surname, so we can encode that as a 1 and get:

Addams11111Morticia0Gomez0Pugsley0Wednesday0x07B9x07B5x07D0x07D3

Since we're storing our data in a columnar fashion, we can also take advantage of being able to use different encodings for different fields. For example, the year of birth field's minimum value is 1973, so instead of having to store those as 2-byte values, we can rebase them to having 1973 as 0, leaving us with:

Addams11111Morticia0Gomez0Pugsley0Wednesday0x04x00x17xA0

By applying those few simple transformations, we've got the size down by about 25%, even on this tiny data set. We could definitely compress it further by using run-length encoding on the surname (so we'd just encode Addams4), but this has the benefit of being able to operate on the compressed data directly, which is significantly more efficient, both in terms of speed and memory usage.

Now that we know how this works in theory, here's how it works in practice.

EncodingSize in Bytes
Raw Non-columnar Data10,192,244
GZipped Non-columnar Data1,031,028
Raw Columnar Data10,192,244
GZipped Columnar Data90,464

The columnar data storage compressed an order of magnitude better than the row-wise! Of course, we've got the obligatory caveat that your mileage may vary, depending on your data and the compression algorithm that you choose. I wouldn't recommend using gzip in general, since you have to decompress everything up to the record that you want to retrieve the record (you can ameliorate that somewhat by compressing in blocks, so you put an upper bound on how much you have to decompress); I only used gzip because it was convenient.

Of course, like many technologies, columnar stores aren't all rainbows & sunshine. If you recall, I mentioned earlier that storing data in a columnar format wreaks havoc on spatial locality. Since spatial locality is so important to performance in modern computers, what do we do? Lose the performance benefits of locality? Lose the performance benefits of storing more of our data in RAM? No! We rebel! We reject rigid hierarchies! We embrace the hybrid approach!  Ahem, sorry, I got a bit carried away there. If we stick to compression approaches that don't require decoding the n-1 records to get the value of the nth record (or, better still, that allow you to work with the data in its compressed form), we can encode the columns as if they were being stored in a columnar format, but then write them out concatenated with the other values of their record. This way, we end up with a data store that looks like this:

Addams11Morticia0x041Gomez0x001Pugsley0x171Wednesday0xA0

Using this storage scheme does have the downside of making it harder to scan a column for a particular value, but ultimately that's why we have indexes.





Sunday, December 14, 2014

Padding Oracles: The Poo in Poodle

Buckle up, 'cause this involves cryptography and it's going to get a bit technical. That said, I'll try to keep it at a high level (nothing like S-boxes and avalanching); as long as you understand exclusive-or (xor) you'll have no trouble following along. To (hopefully) make it easier, I'll put major points in sections with headings.

First of all, a padding oracle has nothing to do with putting sumo suits on databases. They're a vulnerability in cryptographic systems arising from encryption algorithms needing data (we'll call it plaintext) to come in blocks of a fixed size (inventively called the blocksize). If your encryption algorithm works on blocks of size 4, but you've only got 3 bytes of data to send it, you're going to need a way to get it up to 4 bytes. The way that that's done is called padding.

Padding

We need a way to make our messages a multiple of our encryption algorithm's blocksize while still being able to tell what the plaintext was when we decrypt it (i.e. if you pad with random bytes, how will you know what's padding vs plaintext?). Fortunately, there's already a handy standard defined for us (it's PKCS #7 if you're interested). The way it works is that you encrypt each block as normal until you get to the last one. For the last block, you calculate the padding as the blocksize - the length of the last block. E.g. if your blocksize is 4 and your last block is 3 bytes long, your padding is 1, so you append a 1 to the last block; if your last block was of length 2, you'd append 2 bytes with the value of 2. More concretely, with a blocksize of 4 bytes, the plaintext {1,2,3} becomes {1,2,3,1} and {1,3} would become {1,3,2,2}. Now, you may be wondering what would happen if your plaintext was {1,2,3,1}; would the last 1 be discarded?  Nope, we always add padding, so {1,2,3,1} would become {1,2,3,1,4,4,4,4} (it's not perfect since it bloats the data, but it has the nice effect that it means we can decrypt the data).

Now we've covered the case where we don't have enough data to fill a block for encryption, and we've seen that you may have more than one block of data to encrypt, and (most) encryption algorithms work on exactly one block of data, so we need to figure out what to do about that. Turning a block cipher (encryption algorithms that work on data a block at a time; i.e. most of them) into a stream cipher (one that works on an arbitrary amount of data) is an important area of research and one that's subtly easy to mess up. So, let's dig into some of those methods (which are frequently called "modes" in crypto-speak).

From Block to Stream

The most obvious approach to encrypting a stream of data longer than one block is to just keep applying the algorithm. This particular way of doing things is called Electronic Code Book (ECB), and it has one really nasty problem: using ECB means that any particular byte will always encrypt to the same value for a given key. A demonstration may help demonstrate why that's so bad.

Starting with this plaintext image:

I applied a Caesar cipher (really simple algorithm that does a constant mapping from one byte to another; you'd never use this in practice, but it exacerbates the effects of ECB so they're easier to see), chaining them with ECB.






The result was this image:

That's obviously not doing a great job of hiding what the plaintext was. In fact, with this particular encryption algorithm, it's no more secure than a newspaper puzzle.

It's pretty clear that we're going to need something better, something that muddles the plaintext of each block before it goes into the encryption algorithm. 

Our next contender mixes up each block of plaintext by applying xor to it and the result of encrypting the previous block and is called (again with the imaginative names) Cipher Block Chaining (or CBC for short; we LOVE acronyms!)


To see how CBC fares, I put it through the same encryption algorithm used for ECB and got this image (spoiler alert: you can still kind of see the shape; that's because the encryption has a blocksize of 1):

Even with such a painfully simple encryption algorithm, CBC isn't a complete failure. You actually have to look at the image to see the schooner Pac Man and you can't readily tell that there are two of them overlapping. 

The only obvious problem with CBC is what do you do about the first block? You could just work from its plaintext, but then the first block would have the same issues as ECB (and the first block in any data exchange tends to have a lot of data that's the same every time; e.g. GET / HTTP/1.1\r\n)



To avoid having our first block basically be in ECB mode, we add a block of random data, called an Initialization Vector (IV; goooooooo acronyms!) Using an IV, CBC mode looks pretty secure and it was used for a very long time. Unfortunately, appearances can be deceiving and there's a devastating vulnerability hidden there. To see it, let's make a trip to Delphi to see the padding oracle.

The Padding Oracle (of CBC Mode)

Before getting into the details of the padding oracle, a quick refresher on the only thing you'll need to know about xor -- xor undoes itself. By that, I mean that if you have a xor b = c, then c xor a = b and c xor b = a. It is to itself as subtraction is to addition. 

The base assumptions for the padding oracle are that you have the ciphertext (one or more blocks of it) and if one end of the communications channel receives data that it can't decrypt properly, it'll respond with an error message.  One cause that it may not decrypt properly is that the padding is wrong. Looking back, we see two things: 1) we know what the last byte of a block has to be if it's the last block of a stream and 2) the plaintext will be modified by xor'ing it against the prior block's ciphertext prior to encrypting (which means that the decryption will yield the xor of the plaintext and the previous block's ciphertext).

Given those assumptions, here's how the padding oracle works. For the sake of simplicity, I'll refer to the end of the communications channel to which we're sending data to be decrypted as the server.
 
  1. Take one block of ciphertext out of the stream (doesn't really matter which one)
  2. Send that block to the server with a custom IV
  3. If you get an error that decryption failed, change the last byte of the IV and try again; repeat this step until decryption succeeds and you'll know that the last byte of the plaintext xor'ed with your IV is one of two values -- either 1 or 2 if the penultimate byte of the xor'ed plaintext is 2 (or 3 if the 2 penultimate bytes are 3, etc)
  4. At this point, you suspect that the plaintext xor the IV that you're sending ends in 1; so let's say that the last byte of the original plaintext is P and the last byte of the IV is I and the last byte of the xor'ed plaintext is Px. We know that P xor I = Px and we know that Px is probably 1, so we can reorder that so that Px xor I = P and now we can calculate P. The oracle has started speaking and is pronouncing the doom of your message's secrecy.
  5. Now that we know what the last byte of the actual plaintext is (what we found in 4), we can change our IV so that the last byte will decrypt to 2 and repeat step 3 modifying the next-to-late byte until the server doesn't report a decryption error. 
  6. If you get through all 256 possible values for that byte without a success, then you know that you did not find the actual value of the last byte, so you drop back to step 3 and continue until (with the IV that you were using in 3, without 4's changes) it successfully decrypts again. 
  7. Once you get a success, you've definitely found the actual value of the last byte. You can use that and the value that you got the first time to figure out what value you had generated the first time and then you'll know that the plaintext has a consecutive run of one less than that many of that number; i.e. if you find out that you had originally gotten a 3, then you know that the two penultimate bytes of the plaintext are 3. 
  8. From steps 3-7, you now know at least the last two bytes of plaintext and you can keep applying step 3-7 moving one byte towards the front each time until you have the entire plaintext and then just keep repeating those steps for each block

Silencing the Oracle

Things look bleak; the Oracle's pronouncing our doom and Cassandra's gone hoarse from warning us. Is there anything we can do? Fortunately, yes, there is! Newer versions of TLS (Transport Layer Security, the successor to SSL) also provide other modes for block ciphers like GCM (Galois/Counter Mode) that don't give an attacker that level of control over the IV that gets fed into the xor operation. 

To make sure that your traffic isn't exposed, the most straightforward protection is to completely disable SSLv3 (and earlier, obviously; SSLv2 was broken almost 20 years ago!) and only use newer versions of TLS (v 1.2 if you can, because that's where AES-GCM was introduced as an available algorithm).

Thursday, June 26, 2014

Storing Strings, Succinctly

Alas, I lack an alliterative algorithm for squashing strings into less space, but hopefully I can make it up to you by sharing an improvement on a simple way of storing strings. To make it easier to follow, I'll build up to the full algorithm, step-by-step. 

Step 1: Reduce Randomness

Compression algorithms in general work better if you can reduce the randomness in your data. That may seem impossible, after all, the information is what the information is, but it is actually possible. The way that we're going to reduce randomness is by sorting our strings (e.g. putting them in alphabetic order). As an example, if we have the strings {cat, carrot, car}, we can store them as [car, carrot, cat]. That may not seem like it helps, but it does.

Step 2: Remove Redundancy

Looking at our list above ([car, carrot, cat]), we see that car is a prefix of carrot and carrot and cat share the prefix "ca". We can use that information to cut down on what we have to store, by recording the length of the shared prefix and then just writing the suffix.  What we end up with is [c,a,r,3,r,o,t,2,t], where 3 and 2 are the shared prefix lengths. Now, you may be asking how we can tell when one word ends and the next begins, and if you are, good, because that's a great question! One approach, that will probably be warmly embraced by C programmers, is to null terminate the strings, i.e. end them with a 0 (that's a zero, not an O). That approach works pretty well and leaves us with [c,a,r,0,3,r,o,t,0,2,t]

Step 3: Look at Lengths

English words, by and large, aren't terribly long. With one byte, like the one that we used to store the length of the shared prefix or the one with which we terminate strings, we can store up to the number 255. Most words aren't nearly that long and, unless you're dealing with protein names, you're not likely to encounter any of the few that are longer than that. To take advantage of that, instead of null-terminating, let's store the length of the prefix and the length of the suffix. Since we're storing lengths, we can also take advantage of words being short by storing the prefix and suffix lengths in one byte, together. If we devote 4 bits to the length of the prefix, we can avoid storing up to 15 bytes of prefix; if the shared prefix is longer than that, we can just duplicate the extra (4 bits isn't a magic number; I arrived at it through trial and error using a list of 99,171 English words, my initial guess at a good number was 3 bits). That leaves us with another 4 bits to store the remainder and a bit of a quandary if the remaining suffix is more than 16 bytes long (it would be 15 but, since the suffix must be at least 1 byte, we trade the ability to store 0 to make the remainder 1 smaller). Thankfully, another compression trick called (among several other things) variable byte encoding, comes to our rescue.

The standard way that variable byte encoding works is to use one bit to encode if there are more bytes in the number. We're going to deviate from that just a bit by instead taking up to the entire 4 bits to store the length. If the length is larger than, or equal to, 15 we store 1's in all 4 bits, subtract 15 from the remainder, and then use standard variable byte encoding to store the rest (even if that's a byte of all zeros). 

Results

So, what has all this bit-twiddling bought us? Encoding 99,171 words, which took a total of 938,848 bytes got it down to 323,370 bytes. Now, that's still not tiny, but it's about a 61% reduction in size. Gzip gets it down to about 256,000 bytes, which is considerably smaller, but this also has the benefit that it should be fairly easy to implement it such that it can encode & decode faster than gzip and permit more nearly random access to elements.

Hey, what do you know, I did manage to describe the algorithm using alliteration! Bonus! I hope this has been at least slightly entertaining, I hope it has been useful, and I hope it has been informative.

Sunday, June 23, 2013

CS 101 Part 2

It's time for the (probably not so) eagerly awaited part 2 of my intro to comp. sci series! We're moving on now to the algorithms which, despite their name, have nothing to do with the former vice president's musical stylings. Algorithms are basically recipes for doing things, and the word comes from the name of the guy who gave us algebra (al-Khwarizmi). Since computers spend an inordinate amount of their time sorting, that's where we'll start, after a brief detour into Big-O notation.

Big O Notation

For most things that you need a computer to do, it takes more time the more things you need it to operate on, for example it takes longer to calculate the sum of 100 numbers than it does for 10; makes sense, right? What Big O notation gives us is a way to specify an upper bound on how badly an algorithm slows down as the number of elements that it has to consider grows. Some examples might help make that clearer.  Calculating the sum of n numbers takes O(n) time, since the time grows linearly with the number of elements (we say that it runs in linear time, for obvious reasons). If you want to calculate the result of multiplying every number in that list by every other number in it, that's O(n2) since for all n elements, you have to perform n calculations, so the total number of calculations is n * n.

To simplify things, with Big O notation we focus on the dominating aspect of the algorithm. By that, I mean we focus on the aspect of the algorithm that tends to chew up the most time. For example, if you have x + y = z, and x is 0.00001 and y is 1,000,000 very nearly all of the value of z is determined by y, so we'd say that y dominates. Likewise, n3 dominates n2, which dominates n. Focusing on the dominating component lets us simplify by dropping the less significant portions, e.g. O(5n3+ 975n2+ 8,000,000) is simplified to just O(n3) since, as n approaches infinity, the n3 portion will dominate the other components.

It's important to note that Big O notation does not tell us which of two algorithms is faster, it just gives us an idea for how an algorithm will behave as the input size grows. For example, there are some algorithms for operations on matrices that have asymptotically better behavior (that is, behavior as the input size approaches infinity) than the algorithms that are generally used. The catch, though, is that have hidden constants so large that they're slower than the simpler algorithm for any input size smaller than every molecule in the known universe.

Some runtimes that you'll need to know because they come up frequently are:
O(n)
Linear time; if you double the input size, the runtime doubles and if you square it, then the runtime squares
O(n log n)
The log is base-2, which means that to see an increase of 1, n has to be doubled
O(n2)
Quadratic time; the runtime grows as the square of its input size, so if you double n, the runtime goes up 4 times; quadruple it and the runtime goes up 16 times.

Sorting

It's at about this point that most books introduce bubble sort. I'm not sure why, since it's fairly complicated, not particularly intuitive and it has a worst-case runtime of O(n2). Instead, I'm going to introduce insertion sort and selection sort, both of which are fairly intuitive since they're the way that humans sort. Most of the sorts that we'll be examining depend on this one observation: a list of size one is always sorted.

Selection Sort

The idea behind selection sort is that you look through the list looking for the smallest element. Once you find it, you put it at the head of the list. Then you look through the list for the 2nd smallest element and put it in the 2nd position. Just keep repeating that until you're done. Selection sort is simple and easy to implement; it also happens to take O(n2) time, so it's fairly slow on large lists.

Insertion Sort

Insertion sort is not vastly different from selection sort, even though it's conceptually working in exactly the opposite way. With insertion sort, you take each element, find the position in the list where it belongs, and put it there. This depends on that key observation before, that a list of size one is already sorted; we use that to say that the first element of the list is sorted and then we proceed to extend the length of that list one element at a time. Insertion sort is simple, easy to implement and really fast when the list is small, but its worst-case runtime (when the list is in exactly the opposite order of what you want) is O(n2), so it can be horrendously slow.

We can make insertion sort a little bit more complicated, though, and have its runtime approach linear the closer it gets to being sorted. If we only look for the proper position of an element if it's not already in its proper position, then a sorted list will be "sorted" in linear time. To do that, we only look for the proper position of an element if it's smaller than the element before it (assuming you're sorting from smallest to largest; otherwise look for its proper position if it's larger than the element before it). You could also use binary search to limit its worst case to O(n log n), but if you do so you'd lose all of insertion sort's benefits.

Divide and Conquer Sorting Algorithms

Now we're getting into a class of sorting algorithms that trade simplicity and obviousness for speed. The underlying observation that makes all of them work in theory is the one above, that a list of size one is always sorted (in practice, we usually get down to a sublist of a couple dozen elements then use insertion sort on those). Building from that observation, we can say that a list is comprised of a lot of sorted sub-lists that we just need to merge. All of the following algorithms boil down to how do we split the list into sublists and how do we merge them back together. 

Quick Sort


With quick sort, we pick some value from the list, called a pivot. Then, we split the list into two lists, one containing all of the elements less than the pivot, the other containing all of the elements greater than the pivot, then do the same to each of the sub-lists. This will eventually get you to the point where all of the sublists are sorted, because they all contain 1 term. In terms of implementation, start from the first element of the list moving toward the end until you find an element larger than the pivot; then starting from the end move towards the beginning until you find an element smaller than the pivot and then swap them. When the indexes meet, you've got the lists split and it's time to do the same to the two sublists.

The runtime of quick sort is a bit tricky. Its average case run time is O(n log n), but it's worst case is O(n2). Fortunately, we rarely see its worst-case (there are tricks that can be pulled to make the odds of seeing its worst case 1/n!; that's n-factorial, not just an exclamation mark).

Quick sort is a good default sorting algorithm because it's fairly simple & cache friendly and, since the splitting phase does all of the work, the merge phase is trivial.

Merge Sort

Where quick sort does all of its work in the split phase, merge sort does all of its work in the merge phase. The split phase just splits the list in half and then sorts the two sublists with merge sort. Doing that repeatedly will eventually get you down to n sorted sublists, each with one element. The merge phase proceeds by taking two sorted sublists and combining them into one sorted list by taking the smallest element of the two lists and putting it in the next position in the combined list until both lists are exhausted. 

Merge sort has a very simple run-time to consider; its best, worst and average case times are all O(n log n). 

Merge sort has the advantages of being pretty simple to implement and cache friendly, but it has the disadvantage of requiring additional memory proportional to the amount taken by the list to be sorted. There are ways of implementing merge sort that do not require additional memory, but they are more complicated and slower (they're not slower in Big-O terms, but they are slower in practice). Merge sort also has the nice property that it's a stable sort (if the list contains two elements that compare equally, the first one will end up first; e.g. if you want students sorted by age and then, within an age bracket by name, you can sort by name and then sort by age and it'll line up properly). Finishing off the list of nice properties of merge sort is that its cache-friendliness extends to disk; if your data is too large to fit into RAM, you're probably going to be implementing at least the merge phase of merge sort on disk.

Heap Sort

This is a bit of an oddity in that it's not a divide-and-conquer algorithm, it just exploits the properties of a heap. Insert all of the elements to be sorted into a heap, then remove them and they'll be ordered. Like the other comparison-based sorts, heap sort is O(n log n), though all of its other attributes (like cache friendliness and stability) depend entirely on the heap implementation.

Comparison-Free Sorting

Now we're leaving the friendly, inclusive world of general-purpose sorting algorithms to get into the real speed demons: the sorting algorithms that don't do any comparisons to sort! These all exploit peculiarities in the data to be sorted to eliminate all comparisons and run in linear time.

Counting Sort

Counting sort relies on its input data to be constrained to a small set of possible values; e.g. student ages. To do a counting sort, make one pass through the data counting how many elements there are with each value. Then, using that information, count the position that each value should start at and then run through the data again, putting each element in the position where it belongs. Going back to the student ages example, if there are three 17 year olds, four 18 year olds and 2 19 year olds, we have starting positions 0, 3, 5. If the first element is a 19-year old, they go in position 5 and the 19 year olds position is incremented to 6. 

Counting sort runs in linear time and very cache friendly, but it does require some extra memory (to store the counts) and won't work if the potentially valid inputs are unbounded.

Radix Sort

I'm going to deviate from the normal way of discussing sorting by using a list of strings (on the off chance that you're not familiar with what strings are, they're just a sequence of characters) instead of a list of integers to discuss radix sort, because it's more obvious what's going on with strings.

To radix sort a list of strings, sort all of the strings based on their first character, then within the sublist of each first letter, sort again by the second letter, and so on until the entire list is completely sorted. By restricting it to the first letter, we open up the possibility of using counting sort to perform that sort, so we still avoid doing comparisons. 

Now, to take radix sort back to integers, we may have to do something a little odd. If you're considering arbitrary length integers and you choose your radix (base) to be 10, you have to do the sort starting with the least significant digit. When dealing with most integers, though, we'd choose the radix to be 256 so that we're sorting a byte at a time; combine that with most integers being either 32 or 64-bit numbers and we get really good performance and linear-time performance. 

The runtime of radix sort is linear, but it's linear both in terms of the number of elements and the average length of the prefixes of those elements that have to be considered to get them sorted (technically all of the comparison-based sorts also have that caveat, but it gets hidden as a comparison, which is typically treated as a constant-time operation).

Bucket Sort

Bucket sort has a lot in common with counting & radix sort and can be easily combined with them. The basic idea behind bucket sort is that you put each element into a bucket and then sort the buckets. Going back to the old standby of sorting people by age, you may do it by putting people in buckets of their age so everyone of any age clumps together. Bucket sort plays nicely with hash maps, but isn't as cache friendly as counting and radix sort. One big benefit of bucket sort, though, is that the valid values can be infinite, since you can make buckets as needed instead of having to create them all ahead of time.

Well, that hits the highlights of sorting! Next time, we'll look into searching (a collection for an item; things like searching for documents containing a word is a more advanced topic, and I may cover it later) and maybe touch on some graph algorithms.