It’s been a long time since I was able to run a repair on my Cassandra cluster. Basically since I went to 1.2, it just hasn’t been possible. And since repairs in Cassandra are pretty much a requirement to normal operation, this is clearly a problem. So in order to deal with the disarray that is Cassandra repairs in 1.2, I found a script originally written by Matt Stump and edited to work with virtual nodes (vnodes) by Brian Gallew. The tl;dr is that the script breaks the repairs down into manageable chunks and allows the repairs to finish. It is available here.
The way this script works (and I highly recommend reading the project README for a full explanation) is that it gets the primary range for a node and breaks it up in to smaller sub-ranges. The instead of running the repair on the full range like a normal repair, it is run on each of the smaller sub-ranges. This works for a variety of reasons including dodging timeout/latency issues and helps with dense node situations where large Merkle trees are required to properly check consistency. It is also useful in Analytics and Solr node setups where there is only a single token per node (also known as num_tokens:1) since it just breaks the range down in to more manageable chunks.
Examples of how to use the script are in the README. Hope this is helpful.
Update: Updated post to have the link point to Brian Gallew’s version on Github since he was kind enough to accept my pull request.