Adding Cross Zone Load Balancing in AWS

One of the new hotness features that Amazon added to their Elastic Load Balancers is cross zone load balancing. This offers the ability to have an unbalanced number of nodes per availability zone within an Amazon region. For instance, if you were load balances across us-east-1a, us-east-1b, and us-east-1c, then you needed to have the same number of instances in each zone otherwise the traffic would skew and overload the zone with fewer instances. If you are auto-scaling, using spots, or just happen to lose instances from time to time, you can easily see where this becomes a problem. Read the rest of this entry »

Pros and Cons of Redis-Resque and SQS

As with any system or application, there are upsides and downsides to using them. The two queueing systems that I want to explore are Resque and Amazon’s Simple Queuing Service. Resque is essentially a set of queuing APIs that run on Redis. Redis is an in-memory data store and is what actually handles the queues. It’s capable of handling complex data structures like lists (what Resque queues use), sets or sorted sets. Amazon’s SQS is an eventually consistent sharded messaging/queueing system.
Read the rest of this entry »

Pig Queries Parsing JSON on Amazons Elastic Map Reduce Using S3 Data

I know the title of this post is a mouthful, but it’s the fun of pushing envelope of existing technologies. What I am looking to do is take my log data stored on S3 (which is in compressed JSON format) and run queries against it. In order to not have to learn everything about setting up Hadoop and still have the ability to leverage the power of Hadoop’s distributed data processing framework and not have to learn how to write map reduce jobs and … (this could go on for a while so I’ll just stop here). For all these reasons, I choose to use Amazon’s Elastic Map infrastructure and Pig.
Read the rest of this entry »

Posted in Hadoop. Tags: , , . 13 Comments »

Distributed Flume Setup With an S3 Sink

I have recently spent a few days getting up to speed with Flume, Cloudera‘s distributed log offering. If you haven’t seen this and deal with lots of logs, you are definitely missing out on a fantastic project. I’m not going to spend time talking about it because you can read more about it in the users guide or in the Quora Flume Topic in ways that are better than I can describe it. But I will tell you about is my experience setting up Flume in a distributed environment to sync logs to an Amazon S3 sink.

As CTO of SimpleReach, a company that does most of it’s work in the cloud, I’m constantly strategizing on how we can take advantage of the cloud for auto-scaling. Depending on the time of day or how much content distribution we are dealing with, we will spawn new instances to accommodate the load. We will still need the logs from those machines for later analysis (batch jobs like making use of Elastic Map Reduce).
Read the rest of this entry »