After being a fairly heavy user of NSQ over the past year or so and finding that it was missing a few features, I decided to jump in and try to add them myself. The only issue was that I didn’t know Go. Since something as simple as not knowing the language the application was written in has never stopped me before, I wasn’t going to let it stop me now.
Backing up for a moment, let’s talk about what NSQ does. NSQ is a realtime distributed message platform designed to operate at scale. It is focused on distributed and decentralized topologies to remove the idea of single points of failure. We use it quite heavily at SimpleReach as it is the backbone of the way data flows through our architecture.
Working with data at large quantities like we do, sometimes you only want a small portion to test things. Since that wasn’t really possible in the current architecture, a few small tweaks were doable. In total, I only added 2 things: finite channel tailing and channel sampling.
Finite Channel Tailing
One of the applications that already ships with NSQ is nsq_tail. nsq_tail allows you to look at the messages coming through a topic by adding a temporary channel to the topic that disappears when its clients disappear. I wanted the ability to only get a finite number of messages similar to the way *nix tail works. In other words, the NSQ equivalent of tail -5 file. To accomplish this, it’s now:
nsq_tail -n 5 --lookupd-http-address=lookup-001.example.com:4161 --topic mytopic
The -n stands for the number of messages that you would like to see pulled from the topic. This was added with this pull request.
The other feature that was added was the ability to take only a percentage of the message being put on a channel from a topic. There are a few use-cases for this. The easiest of these use-cases to think about is statistical information. For starters, if you have 100 million messages passing through a topic per day and want to run some statistical calculations against the data in these messages, statistics obviates the need to look at every message to get an answer. Depending on what you are looking for, you may be able to get away with looking at 10% of the messages. Prior to channel sampling, the client/consumer still had to ingest all the messages from the producer. This meant a lot of extra and wasted network and processor utilization. By shifting the onus from the consumer to the producer, it leaves more processing power for the consumer and makes all of the processing more efficient.
If you want to use channel sampling in Python (assuming you are using nsqd 0.2.25+), when you connect to an NSQ instance, just add sample_rate to the initialization. The following code will get you approximately 10% of the messages that are sent to the topic.
r = nsq.Reader(message_handler=handler,
topic='nsq_reader', channel='asdf', lookupd_poll_interval=15, sample_rate=10)
I still have a little more to do. And these features are available as of 0.2.25. There is a front end monitoring system for NSQ called nsqadmin. I would still like to the ability to see whether or not the channel is being sampled. There is an issue for this here. I’m still getting used to Go, but it’s a been an interesting ride so far learning a complex application.