Debugging Go Program Memory Leak

I was debugging a memory leak issue for our Go application recently. The complexity of the service is moderate, and what I originally intended to do is just bake testing – throwing requests at the service at constant concurrency level and observe the memory footprint of the application as time goes by. To my surprise, the resident memory application grew almost linearly against time – running for 5 minutes, I saw the memory size to be around 160MB; running for 10 minutes it almost got me 320 MB. This was in sharp contrast to the 40 MB from author’s original test results.

Apparently memory was leaking, and the leak highly likely happened along with requests. But where? The application has the mechanism to dump the memory profile into a file and that’s what the service’s author used to capture in-use memory footprint conveniently when he was doing the development. The memory profile in a visual format looks like the following.

I tried to dig something out from such graph, but only to find that in that 320MB memory profile, almost every allocation point was occupying twice the amount of memory as its corresponding point in the 160MB memory profile, which means leaking is everywhere!

Strictly speaking, it is not easy for a language that has garbage collection mechanism such as Go, to leak memory like C/C++. Basically you need keeping references to memory objects and let them not go out of scope, and keep doing this all the time, which is not easy to hide. The memory profile report can tell me that “Line X in file Y has allocated 80MB worth of object that are in use at the instant you took the snapshot”, which you know is not right, but it cannot tell you why it is being kept and who is (annoyingly) keeping them.

I spent hours tracking down the usage of a few suspicious allocation points but got no clue on where the leak could be. Having made quite a few random changes and got no luck rerunning test, I came across a post saying “goroutines can pile up if they hang and cause memory leaking”. Sounds reasonable. So I added this line to code which would make stack traces of all existing goroutines to be printed to console when the application exits.

pprof.Lookup("goroutine").WriteTo(os.Stdout, 1)

I reran the test, was able to see the stack traces this time, and the culprit was immediately clear – there are thousands of clones of one goroutine, all blocked on the same line of code, because of writing to a channel! Further analysis showed that because of a miss consideration, the routine that is supposed to read that channel almost always returns earlier than the routine writing to the channel, and it happens for every request. So a thousand requests leave behind a thousand hanging goroutines, and each such goroutine holds on to a variety of objects. That’s why like mentioned above, it looks like leaking was happening everywhere.

Diving deeper, I found that the application’s memory footprint at the designed concurrency level is just around 10MB, far smaller than 40MB claimed by the original author. A likely reason for this is he was testing unknowing the memory leak, but he terminated the test with fewer requests.

The whole process was tiresome, but fun. Only when you finally find the root cause can you connect the dots and realize all traces were pointing to this hidden root cause, like you finally figured out all mysteries at the end of a thriller movie.

Understanding asynchronous programming with example in Guava

It is not easy to understand asynchronous programming. Even though having worked in Java for many years, quite a few junior or even median level software engineers still have confusion on how to use it and what the word asynchronous implies (the literal definitions are easy to memorize, but not useful if you do not profoundly understand the inner workings). Here is a snippet of code (see bottom of this post) I wrote with Guava to demonstrate the difference between synchronous or asynchronous programming.

First we have SquareRpcService, inside which we have a method squareRPC to which you can provide an integer and it will compute the square of the provided integer and return the result in the form of a future. The future will be fulfilled after 1 second. The 1 second delay is achieved with thread sleeping and does not have much real life meaning. This method can resemble very well any asynchronous API you may have which does not carry out underlying working on the calling thread, and also is a little slow to give you substantial result.

Next inside SyncAsyncCompare class is where you will find the main function. There are two private methods – one is called doCalcSync, inside of which we call squareRPC, then do a blocking wait to get the result, and then adds 1 to the result from squareRPC, and return the final result. This method would block the thread that doCalcSync is invoked on. The other private method is called doCalcAsync, inside of which we also call squareRPC, but instead of doing a block wait to get the result, we chain a callback onto the resultant future, and return that future.

Now look inside the main function for how we use doCalcSync and doCalcAsync to calculate the result for 10 integers “concurrently”. Both doCalcSync and doCalcAsync are called using a thread pool named executorService which contain 1 thread under the hood. We use a stopwatch to get the elapsed time for both approaches. The following shows the result of running this snippet of code.

Finished sync calc on thread 9, result is 1, start time is 101191, end time is 101192 s
Finished sync calc on thread 9, result is 2, start time is 101192, end time is 101193 s
Finished sync calc on thread 9, result is 5, start time is 101193, end time is 101194 s
Finished sync calc on thread 9, result is 10, start time is 101194, end time is 101195 s
Finished sync calc on thread 9, result is 17, start time is 101195, end time is 101196 s
Finished sync calc on thread 9, result is 26, start time is 101196, end time is 101197 s
Finished sync calc on thread 9, result is 37, start time is 101197, end time is 101198 s
Finished sync calc on thread 9, result is 50, start time is 101198, end time is 101199 s
Finished sync calc on thread 9, result is 65, start time is 101199, end time is 101200 s
Finished sync calc on thread 9, result is 82, start time is 101200, end time is 101201 s
Sync Mode: total elapsed time 10201 ms

Finished async calc on thread 9, result is 1, start time is 101201, end time is 101202 s
Finished async calc on thread 9, result is 2, start time is 101201, end time is 101202 s
Finished async calc on thread 9, result is 5, start time is 101201, end time is 101202 s
Finished async calc on thread 9, result is 10, start time is 101201, end time is 101202 s
Finished async calc on thread 9, result is 17, start time is 101201, end time is 101202 s
Finished async calc on thread 9, result is 26, start time is 101201, end time is 101202 s
Finished async calc on thread 9, result is 37, start time is 101201, end time is 101202 s
Finished async calc on thread 9, result is 50, start time is 101201, end time is 101202 s
Finished async calc on thread 9, result is 65, start time is 101201, end time is 101202 s
Finished async calc on thread 9, result is 82, start time is 101201, end time is 101202 s
Async Mode: total elapsed time 1025 ms

You will find that doCalcSync approach takes 10 seconds while the doCalcAsync takes only 1 second. Note that both two approaches carry out all the work with the same executorService. So where does the difference come from?

The difference is all due to the block waiting of ListenableFuture::get. In the doCalcSync case, this will block the thread for 1 second, making the thread essentially idle but not able to move on to next task though all tasks are already waiting in the thread pool queue. And because we only have 1 thread in the thread pool, the execution of these 10 tasks becomes essentially sequential, so it roughly takes 10 * 1 = 10 seconds; In the doCalcAsync case, there is no block waiting; we just chain a callback to each future, which basically tells the thread

Please move on, and as soon as that future is fulfilled, please remember to execute the callback in the same thread pool.

So, all 10 tasks, each calls squareRPC once, are fired off in a blink with one thread, and after 1 second, that same thread is used to harvest 10 fulfilled futures in another blink. So the overall time elapsed is roughly just one second.

To make these two approaches similarly fast. We need to make executorService contain the same number of threads as the number of concurrent tasks, which is apparently not a scalable option.

I have been asked by someone what’s the magic behind as soon as that future is fulfilled, please remember to execute the callback in the same thread pool?. Basically, what he is asking is who is monitor the fulfillment of a future? This looks like a magic, but not so magical if you have looked at the implementation of the future class: a future object maintains many states. When you call addListener to chain your callback, you are registering your function and executor into to the future object. On the other end, later when the future producer fulfills the future and calls complete(), those registered callbacks will be submitted to the executor altogether. That’s how the block waiting is saved: no one is doing monitoring or waiting, it’s the future fulfiller that pulls the trigger.