Tuesday, September 15, 2015

A Tale of Two Methods

Python has some interesting aspects related to functional programming. Its classes and class instances both have methods that can be assigned to other variables and passed around and they can have their methods replaced (a technique, called monkey patching, that usually leads to pain and suffering).

One issue with assigning a method from a class or instance to a variable that can cause significant confusion is that the behavior is different depending on whether the method's being taken from the class or from an instance of the class. Once you know what's going on, the behavior's fairly intuitive, but until then it can be really confusing.

For the sake of discussion, let's define a class, Foo.

class Foo:
    def __init__(self, name):
        self.name=name
        self.aka = "who knows?"

    def bar(self):
        print("Greetings, from {0} (aka {1})".format(self.name, self.aka))

    def set_aka(self, aka):
        self.aka = aka

Then we can instantiate a Foo:
foo1 = Foo("Batman")

Now that we've got Foo and an instance of Foo, let's take their methods and see what we get!

foos_bar = Foo.bar
foo_1s_bar = foo1.bar

foos_bar: <unbound method Foo.bar>
foo_1s_bar: <bound method Foo.bar of <__main__.Foo instance at 0x7f8ae4028b48>>

I've highlighted an important distinction between the two; the method taken from Foo, is an unbound method, whereas the one from foo1 is a bound method. So, what exactly does that mean? Well, let's try calling the methods and see!

foos_bar()

TypeError: unbound method bar() must be called with Foo instance as first argument (got nothing instead)

So that's what that unbound meant; it has no "self" tied to it, so there's no argument being passed for self. Given that that's an unbound method, but foo_1s_bar was a bound method, it seems likely that its behavior will differ. Let's see!

foo_1s_bar()

Greetings, from Batman (aka who knows?)

Ok, that worked, but why? I can't claim to know the details of Python's implementation, however the behavior suggests that the call to Foo is creating an object and using currying to create a method for each of the class' methods that have a
self
bound into them; i.e. foo1's bar is a method that takes no arguments, because it has already had foo1 bound to it as self. We can test that by instantiating another Foo and setting its bar method.

foo2 = Foo("Joker")
foo2.bar = foo_1s_bar
foo2.bar()

Greetings, from Batman (aka who knows?)

Aha! foo_1s_bar retains its self, even if you attach it to a different instance of Foo! Now that we've seen what the bound method is all about, what happens if we pass an instance of Foo to foos_bar?

foos_bar(foo1)
foos_bar(foo2)

Greetings, from Batman (aka who knows?)
Greetings, from Joker (aka who knows?)


We've mostly covered the difference between bound and unbound methods, but there's one more oddball case. What if we attach foo1's bar to Foo and then instantiate another Foo?

Foo.bar = foo_1s_bar
foo3 = Foo("The Shadow")
foo3.bar()

Greetings, from Batman (aka who knows?)

Out of curiosity, I also tested calling bar on a Foo that was instantiated before replacing Foo's bar (foo2; the one set to be Joker). Reasonably, you might expect that it'd call that foo's bar, totally ignoring the change to Foo. If it was doing currying to create the methods, then it definitely wouldn't have any effect. Alas...

Greetings, from Batman (aka who knows?)

Well, that rules out currying at instantiation as the implementation of bound methods, and gives us a major warning about the dangers of replacing unbound methods!

There are two more things to test: setting a bound method from one instance onto another instance and setting a bound method onto a class as a method that wasn't defined. The results of the latter aren't exactly a shocker now that we've seen what happens when you replace an unbound method with a bound one. It has the exact same behavior. The former, on the other hand, could go either way. Unfortunately, it does not. Its self is still bound to the object from which the method was referenced.

Thursday, February 26, 2015

Trust Chains in SSL: Distributing Identity Verification

Knowing with whom you're dealing is an important part of any transaction, and it has been since time immemorial. Over the years, we've developed a variety of ways to verify one another's identity, with varying levels of rigor.

Identity Verification in the Non-electronic World

You're probably familiar with a lot of the ways that we verify identities outside of the electronic world. In cases where there's little to lose if we're misled and/or the other party has little to gain by misleading us, we just tell each other our names and just assume that the person is who they say they are. When the stakes get a little higher, we look to our friends to verify someone's identity, for example by introducing us in person. At the highest level of rigor, we require formal identification, usually from some entity like the government (for example, requiring a person to show their driver's license or passport).

From these ways of identifying one another, we've already seen examples of both federated trust (I trust you are who you claim to be because my friend said that's who you are and I trust my friend) and using a trusted third party (I trust you are who you say you are because the government said you are). It probably doesn't come as a huge surprise, but we end up using both in the electronic world.

Moving to the Electronic World

Verifying an entity's identity in the electronic world poses some new challenges; after all, you can't exactly ask a website for its driver's license! To deal with that, a standard named x.509 was co-opted and combined with SSL to allow websites to verify their identity. x.509 is basically a way of tying an identity to a public key (here's my previous post explaining how SSL works, including info on public & private keys). To make it work, an x.509 certificate combines information identifying an entity with a public key (and more information as well, but that's the relevant bits), then takes a secure hash of that combination and encrypts it with the private key that goes with the public key in the certificate. Taking that hash and encrypting it is known as signing the certificate, and a certificate that is signed by itself is called a self-signed certificate.

A self-signed certificate allows a web site to say who it is, with exactly the same level of assurance as a person just telling you their name. Considering we don't trust that level of assurance in important interpersonal dealings, we certainly don't trust it on the internet. To meet the need for a higher level of assurance, we have certificate authorities (usually just referred to as a CA). The service that a CA provides is that they'll sign your certificate for you, asserting that you are who you claim to be; they're basically filling the same role as the government does with the IDs that they provide. When you connect to a site that's providing a certificate signed by a CA, your browser verifies that the site has the private key tied to that x.509 certificate (as part of the SSL handshake) and then verifies that that x.509 certificate was signed by the CA.

If you need to identify a lot of servers, e.g. you're a large corporation that needs to secure email servers, web servers and database servers, it can get very inconvenient (not to mention costly!) to send off to have each certificate signed individually. To deal with that, you can get a certificate that lets you act as the trusted third party that verifies the identity of a set of servers. For example, if you're in charge of example.com, you can get a certificate that would let you sign the certificates for smtp.example.com, pop3.example.com and www.example.com as if you were a CA. If you then hits www.example.com you'll get a certificate from www.example.com that's signed by example.com, which is then signed by a CA. You then verify that certificate by walking back up the chain, verifying each certificate until you get to a root CA. This is called a trust chain and is the internet equivalent of a friend that you trust introducing you to someone and telling you that you can trust them to introduce you to other people.

Who Gets to be a Certificate Authority?

In the real world, it's relatively easy to figure out who will be accepted as a central entity for verifying identities. Governments give ID cards and we (usually) accept that they're a legitimate authority to verify someone's identity. We also have a a parallel to trust chains in the form of ID cards from universities, employers, etc, where we trust that they have verified a person's identity, at least in the context of the verifier. For example, if someone has a college ID, we assume that they have been verified as students at that college.

In the electronic world, we have root certificate authorities. Your web browser comes with a lot of them, some of them you may trust, others you may not, many you'll never encounter being used in your day-to-day browsing. You can add CAs to your list of trusted authorities or remove them as you see fit, though most people just accept the browser vendor defaults. Being in the default list of trusted certificates makes those CAs de facto trusted third parties for verifying identities.

Breaking Chains & Violating Trust

There's a good chance that you've heard about Superfish, the malware that shipped on some Lenovo laptops. Superfish installs a proxy (a piece of software that sits between your browser and the internet at large) that intercepts your web traffic to inject advertisements into sites that you visit. Normally, TLS (the successor to SSL) would protect you from software like that, but Superfish installs its own certificate as a root CA. Since it makes itself a root CA, Superfish is able to generate a certificate for any site that you visit and sign it so that it appears to be a legitimately signed certificate, prompting absolutely no warning from your browser.

On its own, Superfish is annoying and potentially dangerous if their ad server was compromised, but design errors make it exponentially worse. Superfish uses a single signing certificate for all installations. That means that if anyone gets their hands on that certificate, they can intercept and modify the web traffic of anyone with the software installed on their computer. That means they could easily get the passwords to your email, social media sites and online banking accounts. Even worse (yes, amazingly, it can get worse!) they can inject malicious content into web sites that you trust, potentially even exploiting vulnerabilities in your browser or plugins to infect your computer with additional persistent malware, leaving you even more vulnerable than before.

Sunday, February 1, 2015

Why Go Columnar?

If you've had to deal with storing data over the past few years, there's a good chance that you've heard people talking about columnar data stores. A lot of the NoSQL data stores are columnar and most of the old guard RDBMS vendors are even adding support for hybrid columnar/row-based data stores. It's gotten to the point that if you announced the release of a database that doesn't have at least hybrid columnar support you'd have people either collapsing from a terrible case of the vapors or deriding your database as a dreadfully gauche homage to obsolescence! I have no idea why you'd announce a new database to a room full of art critics and people from antebellum Georgia.

You may be wondering why so many people care so much about columnar stores, considering whether you store data in a columnar format or row-wise format, you're still storing the same amount of data. Before I get into the gory details of how a columnar store can be better than a row-based store, though, I'm going to explain just what it means for something to be a columnar data store.

For this example, we'll use the following database of people with their surname, first name and year of birth.

{
   ("Addams", "Morticia", 1977), 
   ("Addams", "Gomez", 1973), 
   ("Addams", "Pugsley", 2000),
   ("Addams", "Wednesday", 2003)
}

In a non-columnar (or row-wise) store, the data would be stored with each entry stored in one block. It would look something like this:

Addams0Morticia0x07B9Addams0Gomez0x07B5Addams0Pugsley0x07D0Addams0Wednesday0x07D3

What makes a columnar store different is that it splits up the records and stores each type of data together; i.e. it stores all of the surnames together, all of the first names together and all of the years of birth together. In a columnar store, the data would be stored like this:

Addams0Addams0Addams0Addams0Morticia0Gomez0Pugsley0Wednesday0x07B9x07B5x07D0x07D3

Now that we've covered what a columnar store is, you're probably thinking either, "Ok, that's nice. So what?" or "Are you mad?! You've lost spatial locality and will have to do multiple reads to pull a single record!" Ultimately, it comes down to economics: the difference in speed and cost between RAM and hard drives (even solid state drives) is astonishing. A 2 TB hard drive will set you back about $100 and can be found in fairly low-end PCs. 2 TB of RAM will set you back in the ballpark of about $50,000 and you'll only see it in high end servers. On the other hand, the RAM is about a million times faster than the disk. Even though the columnar store and the row-wise store have the same amount of data, a closer examination shows that the columnar store has its redundancy more tightly packed. We can use that to encode the columns separately and save space. For example, we see that there's only one value in surname, so we can encode that as a 1 and get:

Addams11111Morticia0Gomez0Pugsley0Wednesday0x07B9x07B5x07D0x07D3

Since we're storing our data in a columnar fashion, we can also take advantage of being able to use different encodings for different fields. For example, the year of birth field's minimum value is 1973, so instead of having to store those as 2-byte values, we can rebase them to having 1973 as 0, leaving us with:

Addams11111Morticia0Gomez0Pugsley0Wednesday0x04x00x17xA0

By applying those few simple transformations, we've got the size down by about 25%, even on this tiny data set. We could definitely compress it further by using run-length encoding on the surname (so we'd just encode Addams4), but this has the benefit of being able to operate on the compressed data directly, which is significantly more efficient, both in terms of speed and memory usage.

Now that we know how this works in theory, here's how it works in practice.

EncodingSize in Bytes
Raw Non-columnar Data10,192,244
GZipped Non-columnar Data1,031,028
Raw Columnar Data10,192,244
GZipped Columnar Data90,464

The columnar data storage compressed an order of magnitude better than the row-wise! Of course, we've got the obligatory caveat that your mileage may vary, depending on your data and the compression algorithm that you choose. I wouldn't recommend using gzip in general, since you have to decompress everything up to the record that you want to retrieve the record (you can ameliorate that somewhat by compressing in blocks, so you put an upper bound on how much you have to decompress); I only used gzip because it was convenient.

Of course, like many technologies, columnar stores aren't all rainbows & sunshine. If you recall, I mentioned earlier that storing data in a columnar format wreaks havoc on spatial locality. Since spatial locality is so important to performance in modern computers, what do we do? Lose the performance benefits of locality? Lose the performance benefits of storing more of our data in RAM? No! We rebel! We reject rigid hierarchies! We embrace the hybrid approach!  Ahem, sorry, I got a bit carried away there. If we stick to compression approaches that don't require decoding the n-1 records to get the value of the nth record (or, better still, that allow you to work with the data in its compressed form), we can encode the columns as if they were being stored in a columnar format, but then write them out concatenated with the other values of their record. This way, we end up with a data store that looks like this:

Addams11Morticia0x041Gomez0x001Pugsley0x171Wednesday0xA0

Using this storage scheme does have the downside of making it harder to scan a column for a particular value, but ultimately that's why we have indexes.