Monday, January 11, 2016

Crypto Kindly

Encryption can be a daunting topic, with symmetric keys, asymmetric keys, public and private keys, RSA and Diffie-Hellman, HMACs and hashing algorithms, it can seem like too much for anyone to ever understand! To try to cut through the confusion, I've put together this brief overview of some of the relevant topics. To keep things approachable, I'm leaving out a lot of the details that you'd need to actually implement any of these algorithms, but it's usually best to use previously written and tested libraries anyway, since subtle things like differences in the time to complete an operation can be used to break encryption.

Hashing vs Encryption

The first thing to get out of the way is the difference between hashing and encryption. One of the most obvious differentiating factors between the two is that if you encrypt something you generally intend to get that information back, whereas if you hash it, you explicitly do not want to be able to get the data back from the hash. 

Another important difference between the two is that the result of hash algorithms is constant, regardless of how much or how little data you put through them; encryption, on the other hand, produces output that's at least as large as its input.

In a perfect world, a hash algorithm would have an irreversible (that is, given a hash it's impossible to determine what data produced it) mapping from input to output, with no two input values producing the same output value. Unfortunately, since there are infinitely many potential input values to a hash algorithm and a finite number of output values (since its length is constant), for any particular input there may be infinitely many other input values that will produce the same output. When two different input values produce the same output value, we call that a collision. Making intentionally creating a collision as hard as possible is one of the key design goals of hashing algorithms, since being able to create collisions at will leads to attacks against most of the uses of hashing. When security researchers refer to a hashing algorithm as being broken, what they usually mean is that it has become relatively easy to find collisions. 

Hashing is most often used to either verify that a piece of data has not changed or to obfuscate data like passwords so that you can verify that a given value matched a prior known value without having to store that value. 

Verifying that data hasn't changed using a hash involves combining the hash with variants of encryption. If the party sending the data has a secret value that it shares with the party receiving the data, you can use a hash-based message authentication code (HMAC) to make a verifiable tag for the data. Basically, the way that HMAC works is by mixing the secret in with the data and then hashing the lot, potentially with multiple rounds of the hash algorithm. The receiving party can recalculate the hash with the secret and, if they match, know that whoever sent the data also had the shared secret.

If you can't reliably share a secret between the sender & receiver, another option is to use asymmetric key encryption (I'll explain that in more detail a little bit later). To create the tag that will allow verification with asymmetric key encryption, you generate the hash and then encrypt it with your private key. The receiving party gets your public key through some means, e.g. by looking it up in LDAP or pulling it from an x.509 certificate, and uses that to verify that the hash was encrypted with the private key.

Symmetric vs Asymmetric Key Encryption

Symmetric & asymmetric key encryption get their names from whether you use a single key for both encryption & decryption (symmetric key) or one key for encryption and a different one for decryption (asymmetric key) -- although either key in an asymmetric key pair can be used for encryption, changing the effect of the encryption from privacy (using the public key) to verifying the sender (using the private key). 

Symmetric key encryption typically involves "small" keys (e.g. 256 bits) and tends to be relatively fast, especially when implemented in hardware. A symmetric key algorithm typically involves several iterations (called rounds) where the key is mixed in with the data (e.g. via bitwise xor), the bits of the data are shuffled, bytes are swapped based on a lookup table and the key is mutated (any particular algorithm may only perform a subset of those and they may be done in other orders). Generally, all of the steps in symmetric key encryption are fairly inexpensive to perform in software and can "easily" be implemented in hardware, like the AES-NI instructions, leading to the relatively low cost of symmetric key encryption.

Asymmetric key encryption generally revolves around math operations that are easy to perform but hard to reverse, e.g. modular arithmetic (basically divide two numbers and the result is the remainder; e.g. 5 mod 3 is 2). Some of the most common asymmetric key algorithms also take advantage of exponentiation, because xyz = xyz, though other algorithms take different approaches, like elliptic curves.  Since the security of exponentiation & mod based algorithms is based entirely on factoring numbers being computationally expensive, these algorithms have to use large key sizes (frequently 2048-bits or more). The computational cost of encryption using an exponentiation & mod style algorithm is linear in the number of bits in the key (exponentiation is log2 of the key, but the exponential in the number of bits, so overall it's linear in the number of bits). The fact that these algorithms take time proportionate to the key length and require large keys makes them tend to be significantly slower than symmetric key algorithms (elliptic curve algorithms are a different beast).

With asymmetric key encryption, you have a public key, which you share with everyone else (i.e. make public) and a private key that you keep private. When determining which key to use for encryption vs signature, the rule of thumb is that you use the key opposite to the one that the entity accessing the data would have. For example, if you're signing something, the other party will have your public key, so you use the private key. If you want to send something to someone such that only they will be able to read it, you'd use their public key (since they have their private key).

Symmetric + Asymmetric Key Encryption

Symmetric and asymmetric key encryption both have pros and cons; symmetric key is fast, but requires that you have found some way to securely share a key. Asymmetric key is slow, but doesn't require a shared key. We can (and do) combine the two to get the best of both worlds by using asymmetric key encryption to create a secure channel to transmit a symmetric key. The symmetric key is small, so the cost of the asymmetric encryption is limited.

Authenticated Encryption

If you need to encrypt your data and ensure that the data hasn't been modified, you can calculate a hash over the encrypted data and sign it, or you can calculate an HMAC over the encrypted data (always calculate the hash over the encrypted data, otherwise you leave yourself open to denial-of-service attacks and potentially timing attacks that can expose your key). Signing the whole stream means you can't verify the data until you've received it all, which may take too long or use too much memory. Calculating HMACs as you go will let you verify & decrypt the data as you go, but it's a bit of a pain to implement correctly. To simplify life, cryptographers created AEAD (authenticated encryption with optional additional data) modes for symmetric key algorithms like AES. These modes make it possible to use a symmetric key algorithm with more than one block of data while protecting underlying patterns in the data and producing a "tag" that can be used to verify that the data was received unaltered from the sender. One of these AEAD modes, Galois Counter Mode (GCM), also has the added benefit of allowing for encrypting multiple blocks of data in parallel, since the IV for a block does not depend on the result of encryption the prior block (unlike many modes, e.g. Cipher FeedBack, also known as CFB mode). The downside of AEAD modes, including GCM, is that they increase the amount of data to be stored/transmitted by the tag length, which may double the size of the input. If you can afford the increase in size, however, the added security (and performance) makes using an AEAD mode a good choice.

No comments: