I was debugging a memory leak issue for our Go application recently. The complexity of the service is moderate, and what I originally intended to do is just bake testing – throwing requests at the service at constant concurrency level and observe the memory footprint of the application as time goes by. To my surprise, the resident memory application grew almost linearly against time – running for 5 minutes, I saw the memory size to be around 160MB; running for 10 minutes it almost got me 320 MB. This was in sharp contrast to the 40 MB from author’s original test results.
Apparently memory was leaking, and the leak highly likely happened along with requests. But where? The application has the mechanism to dump the memory profile into a file and that’s what the service’s author used to capture in-use memory footprint conveniently when he was doing the development. The memory profile in a visual format looks like the following.
I tried to dig something out from such graph, but only to find that in that 320MB memory profile, almost every allocation point was occupying twice the amount of memory as its corresponding point in the 160MB memory profile, which means leaking is everywhere!
Strictly speaking, it is not easy for a language that has garbage collection mechanism such as Go, to leak memory like C/C++. Basically you need keeping references to memory objects and let them not go out of scope, and keep doing this all the time, which is not easy to hide. The memory profile report can tell me that “Line X in file Y has allocated 80MB worth of object that are in use at the instant you took the snapshot”, which you know is not right, but it cannot tell you why it is being kept and who is (annoyingly) keeping them.
I spent hours tracking down the usage of a few suspicious allocation points but got no clue on where the leak could be. Having made quite a few random changes and got no luck rerunning test, I came across a post saying “goroutines can pile up if they hang and cause memory leaking”. Sounds reasonable. So I added this line to code which would make stack traces of all existing goroutines to be printed to console when the application exits.
pprof.Lookup("goroutine").WriteTo(os.Stdout, 1)
I reran the test, was able to see the stack traces this time, and the culprit was immediately clear – there are thousands of clones of one goroutine, all blocked on the same line of code, because of writing to a channel! Further analysis showed that because of a miss consideration, the routine that is supposed to read that channel almost always returns earlier than the routine writing to the channel, and it happens for every request. So a thousand requests leave behind a thousand hanging goroutines, and each such goroutine holds on to a variety of objects. That’s why like mentioned above, it looks like leaking was happening everywhere.
Diving deeper, I found that the application’s memory footprint at the designed concurrency level is just around 10MB, far smaller than 40MB claimed by the original author. A likely reason for this is he was testing unknowing the memory leak, but he terminated the test with fewer requests.
The whole process was tiresome, but fun. Only when you finally find the root cause can you connect the dots and realize all traces were pointing to this hidden root cause, like you finally figured out all mysteries at the end of a thriller movie.