Recently, I had the opportunity to rework a 15,000 line code base in Go. The intention was never to be a total rewrite, merely an update to a neglected app that just hummed along for years with few updates just doing what it was supposed to do. This exercise begged the question, how should I approach this? Should I just start rewriting sections of code? Do I need to profile everything and rewrite the components that performed poorly? Should I just rewrite the entire application more cleanly? There are any number of ways to attack this problem. The way I chose was to prioritize through what the instrumentation of the application offered me as suggestions.
The app that we’re talking about is a web app written in Go. It works in the following fashion. A web request comes in, is validated, and subsequently passed off to a handler. The handler munges the data, does the relevant checks, and is converted to a consistent internal format. This consistent format acts as a base and a message is then created and added to the queueing system to be passed down the rest of the data pipeline for processing and archiving. The instrumentation was done using Statsd and Datadog. But really, any system of instrumentation that provides you will similar data will work. In order to fully understand what I did, it would be very helpful to understand the different data types statsd has to offer.
Step 1: Adding Instrumentation
The running version of code at the time had limited instrumentation. There was basically just some counters on the errors and some timing on the handlers. So I went through and added counters to all the handlers including all the error states. I added gauges, counters, and timing where appropriate to the various memory, garbage collection, and thread (goroutine) information. Then I put a single instance of the application back in to production. I primarily wanted to ensure that the additional load created by the instrumentation wouldn’t be overkill. After letting the application bake in, I went in to Datadog and made a list of the handlers in order of use and average execution times.
Step 2: Deciding on Tasks
Once I had the information to work with, I took a look at where the application was spending most of it’s time in the production environment. If one handler took 300ms per run, and all the rest were taking less than 50ms, it would seem the 300ms handler would be the first thing to rewrite. The issue with that logic is that if the 300ms handler was only being run a hundred times per minute and the 50ms handler was being run a few hundred times per second, then spending time to clean up the longer running handler isn’t going to give you much bang for your buck.
Step 3: Profiling
Since I already knew how long each handler took to run and now I know the frequency that each handler was being used, it’s much easier to decide where to start. While I could easily have accomplished some of this via code profiling, there is a big difference between knowing where your app spends its CPU cycles in production and where your app spends its CPU cycles on contrived calls in a controlled environment. But since I already knew which handlers I wanted to spend my time on, it was just a matter of deciding what to tackle. Here is where the profiling came in handy. I ran a few URLs through the app to see where the most time was being spent. That helped me decide where the best place to spend my time would be.
Now that I knew where the most time was spent and how much time was spent there, it was time to write some benchmarks. To start, I took the slowest running and most frequently used methods and wrote benchmarks for them. If there were improvements to be made, I made them until I was sufficiently happy with the performance. Then I just rinsed and repeated a few times until I felt like there would be a noticeable difference in performance.
It is worth noting that the deeper in to this I got, the more some of the improvements I made started to feel like micro-optimizations. Given that there are an entire tier of servers each processing thousands of requests per second, attacking micro-optimizations after the big stuff wasn’t bad. If the capacity required for this service can be reduced by 5% – 10%, you are still talking hundreds or thousands of dollars a month in savings depending on the number of app servers. I also added better test coverage along the way to ensure that my changes didn’t begin yielding unexpected results. An additional benefit to this approach is that now I have a consistently updating and available set of data to show me where the performance and utilization of the application actually is. This also gives monitoring systems the ability to determine when something is a little off and notify you that things have changed.
While there are many ways to approach prioritizing code rewrites, I happen to find this methodology to be the most useful. You get lots of ability to introspect your application while creating a performance history. The final side effect of this (and arguably my favorite) is that it makes the ops people happy. No one likes unhappy ops people. And since they are the ones who typically are on the front lines having to assess how bad a problem is before alerting the developers, the more information they have at their disposal, the better.