February 9, 2017
By: Paul Bostrom

Clojure memory leak in production

A few weeks ago I started seeing alerts that an abnormal number of instances of our API service were failing health checks. It's not unusual for the occasional instance to fail a health check, at which point it will be taken out of commission and replaced with a fresh instance. But having several instances fail within quick succession indicated there was a larger issue that would need to be investigated. I reviewed the activity history of our auto-scaling group, which is responsible for maintaining the health of our server cluster. I noticed that instances would be up for a couple of days before they began to fail health checks, and all instances would fail at some point. This smelled like a memory leak to me, so I took a look at our CloudWatch metrics for JVM memory usage. All of our instances displayed similar patterns of memory consumption:

CloudWatch memory usage: memory leak

Based on this I was rather confident that we had a consistent memory leak. Fortunately, "consistent" is the key word here, because I didn't have to reproduce some complicated scenario to expose the leak. Every instance in production would consume, over the course of a couple days, all of the available JVM memory (which maxes out at 4GB in our config). This should be relatively straightforward to debug with a JVM profiler (I'll use the YourKit profiler). All of our production JVM processes are started with the YourKit agent enabled, so it's just a matter of tunneling to the agent's port with YourKit. First, I take the instance out of production because 1) I don't want to further deteriorate a production instance during my investigation and 2) I don't want the instance to be terminated by the auto-scaling group while I'm debugging. I open up YourKit and connect to the instance. Selecting the "Memory" view shows me more detail on the memory usage than is available from my CloudWatch metrics:

YourKit memory usage - leak

Using 3.4 GB of 3.9 GB available might be normal for some JVM workloads, but I know from experience that this is not normal usage for our application (later we'll look at our memory metrics once we've patched the problem). I need a greater level of detail about how my application is using memory, so I'll capture a memory snapshot using YourKit:

Capture Memory Snapshot

On my first attempt, I naively select the default option to "capture snapshot in YourKit Java Profiler format", but it almost immediately crashes the JVM. This option attempts to use the JVM's existing memory (what's left of it anyway) during the process of capturing the snapshot, causing the JVM to run out of memory. I'm going to need to use the second option: "Capture snapshot in HPROF format via JVM built-in dumper". I choose this option, then wait a while as YourKit captures and transfers a 3+ GB heap dump. Once I've loaded the memory dump, I click on "Biggest objects (dominators)" to see if there are any smoking guns, and right away I see the culprit:

Clojure var memoized fn

This shows that there is a clojure.lang.Var containing a memoized fn which is using up 2.4 GB of heap space. This is probably not ideal. Clicking on the "Object explorer" view I can inspect the namespace and symbol name of this Var:

Clojure var memoized fn

Now that I've identified the source of the memory leak, I have enough information to debug this memoized function locally. Specifically, I want to determine what range of arguments this function receives under a normal load. It turns out that the code was organized in such a way that almost every function call used a unique set of arguments, causing the underlying data structures of memoize to increase unbounded. A review of the code led us to the conclusion that the memoizing the function was unnecessary and did not provide any apparent performance gains. Once we patched the code in production, our memory consumption stabilized at around 1.3 GB:

CloudWatch memory usage: healthy

YourKit memory usage - patched

Conclusion

The most important lesson is that you must be careful when memoizing function calls in Clojure. Carefully review any memoized functions and ensure that the set of all possible inputs is both bounded and small (relative to your overall memory usage). Otherwise, debugging with a JVM profiler like YourKit is fairly straightforward.

We're Hiring

Check out our job openings if you're interested in working on this stuff.

Tags: production profiling memory Clojure JVM