Batching Work For Efficiency and Tuning

I’ve been talking a lot about message systems in distributed architectures lately. And one of the slides I show in my talks is a slide about compressing messages before writes to the database. In other words, if you have 150k messages per second coming in which would translate 1:1 in writes and force your database(s) to incur a 150k write per second load, you pull in all those messages in to memory for a short period (say one minute) and group them and write the group in batch. Depending on how much you can group, you can easily cut your write load by an order of magnitude.

For example, if you are writing page views for URLs, the more frequently you see a URL, the better the batch compression is. If you see a URL 100 times in one minute which translates to one message per page view or 100 messages, then instead of writing 100 messages which is 100 database writes, you just write 1 message and incur one database write. If you extrapolate this out, that is a lot of savings and can easily change the complexion of your database workload.

This is the slide in question:

Aggregator Flow Slide

Flow of message aggregation

Now the question I frequently get asked around this slide is, when you are using Cassandra (which is generally a write optimized database), why are you batching your writes so heavily?

The answer is because Cassandra is most difficult to tune for a mixed workload. A mixed workload is when your database is doing a roughly equal amount of reads and writes. It is much easier to tune Cassandra for either a write heavy or a read heavy workload. By batching your writes, you can allow yourself Cassandra to be a more read heavy system and then optimize your settings accordingly. I would even contend that batching and minimizing write load is similar in mindset to caching to minimize the read load allowing your database to be performant.