What’s So Great About Cassandra’s Composite Columns?

There are a lot of things I really like about Cassandra. But one thing in particular I like in creating a schema is having access to composite columns (read about composite columns and their origins here on Datastax’s blog). Let’s start simple with explaining a composite columns and then we can dive right into why they are so much fun to work with.

A composite column is basically a column that has a comparator made up of other comparators. So that is to say, rather than just having something like:

UTF8Type()

You can do something like:

CompositeType(UTF8Type,UTF8Type)

But making storage more complex isn’t a great thing either. So how does adding more to a name make things easier? Think of it as a method of grouping. So let’s say you want to store information about a blog post URL (and for the sake of example, let’s say that the hashed UUID version of the URL is the rowkey). How do you break up what’s important about the URL into storage? In a common WordPress post, you have tags, categories, published date, title, URL, id, content, author(s), and even custom fields. So I would structure that as follows:

2311d5d0-8574-c25d-69e8-1e17aba15c67:
    meta:title => "What's So Great About Cassandra's Composite Columns?"
    tag:cassandra => ''
    tag:database => ''
    category:databases => ''
    meta:published_date => '2012-08-01 08:00:00 UTC'
    meta:content => 'post content'
    custom:field1 => 'value1'
    custom:field2 => 'value2'
    meta:url => 'http://eric.lubow.org/2012/databases/whats-so-great-about-cassandras-composite-columns/'
    meta:id => '1034'
    author:eric => ''

Now that we have a layout, let’s see why it’s good to have it broken up like this. If you want to load up an edit screen or render the page, then you can grab the entire row. But if you just want part of the row (which is a great thing when you have much wider rows), then you can do things like just ask Cassandra for the meta: data or just the tag:‘s data. Like any other data store, Cassandra returns the data more efficiently when being asked for something smaller and more specific than just an entire row.

I know this is a bit of a contrived example and therefore is overly simplistic. But the idea here is to show how one can extrapolate a storage methodology for a column family using composite columns. So in a bit more complex of an example, think of an event tracking system where we are going to be tracking things based on a URL (also in hashed UUID form as above). So we’ll build a composite row key of the timestamp in milliseconds since epoch of the beginning of the day and the hashed URL UUID. Note: I’m going to ignore the fact for examples sake that this row key methodology could create hotspots in the cluster.

1343520000000:2311d5d0-8574-c25d-69e8-1e17aba15c67:
  pageview:1343605189000:a3faae00-6286-11e1-8b6f-fbab07b8ddb9 => '{"referrer":"http://eric.lubow.org/",
       "user-agent":"Mozilla/5.0 (Windows NT 6.0; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
       "languages":"en-US,en;q=0.8","ip":"24.118.178.211"}'
  pageview:1343605189100:58e507e0-9ecb-11e1-a4e6-b50fe5e7b8a8 => '{"referrer":"http://eric.lubow.org/",
       "user-agent":"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0E; .NET4.0C)",
       "languages":"en-US,en;q=0.8","ip":"50.88.24.197"}'
  click_through:1343605199200:66036860-9ec9-11e1-90f5-fbb6d7b0017e => '{"initial":"http://eric.lubow.org/","final":"http://www.simplereach.com"}'

What this allows you to do is to create queries where you say, “give me all pageviews that took place between 2am and 3am (in milliseconds since epoch form). I’ve also only put pageviews and a single custom event in the example above. But you can imagine a situation where you can create any number of event types and have the event type be the first comparator in the CompositeType(), the timestamp can be the second comparator, and the third is a Time UUID so that if you have more than one column per event, you can grab all the columns for a particular event.

As you can see by the two over-simplified examples above, composite columns in Cassandra have the potential to be very powerful in you know how your data is going to look and how you’d like to access it. As you note in the second example, you can’t find all events that happened between 2am and 3am with a single query. The only option is to create a loop with the list of event types and slice out the events per time period for each event type (in each iteration of the loop). So knowing your data access patterns is important for creating your data storage patterns. There may be a little trial and error involved as you scale and add more data. But knowing the tools that are available to you is an incredibly important step in that process.