Investigating Memory Leaks in Java

Recently, at work, I was handed over a web service that was largely forgotten. It was written in Java and it’s only responsibility was to power a reporting console used by the operations team in the eCommerce company that I work with.

The job was simple, one of the backend services that this service uses was being deprecated and hence there was a need to upgrade it to start using the V2 service, satisfying all use cases/requirements & in a way that does not require any UI side changes since we were close to a sale and creating dependency on other teams efficiently is NP-complete.

While investigating the code, I realised the changes were simple enough since there was a particularly interesting way that this service was calling the deprecated service.

ProxyAPIArchitechure

This was accomplished using a JAX-RS Client class. Something, like this gist I created for the demo. Ignore error handling and considering only a GET call.

So, I thought of a concise solution like this.

UpdatedProxyAPIArchitechure

This would require no changes in the client if I was able to massage the responses properly and there was no loss of information. However, as soon as I tested this, this did not work in one go. A Jackson parsing error was thrown.

com.fasterxml.jackson.core.JsonParseException: Illegal character ((CTRL-CHAR, code 31)): only regular white space (\r, \n, \t) is allowed between tokens

I realised that there must be some encoding on the new server which is causing the issue and found out that the new server always sends a gzip encoded stream. I am new to the Java stack and so I did what we all are taught the best to search the error & to no surprises I got this result.

https://stackoverflow.com/questions/17834028/what-is-the-jersey-2-0-equivalent-of-gzipcontentencodingfilter

And So I made a little change like this

Voila !! It works !! All test passes. Great!

I guessed that was the end of the story, until, after a few hours of deployment, I received an alert email. A health check of one of the instances was failing. My first thought was that probably the hardware failed since there were multiple such occasions in recent past. I thought if I can just quickly check and resolve my peace. However, the machine was up. Something else has caused the process to die. Another instance died in another few minutes. PANIC !!!!

Okay! huh… No other changes except those few liners were made in the code base recently. Did I cause this? I looked for heap dump but realised that the process does not run with a -XX:+HeapDumpOnOutOfMemoryError . I checked the metrics on Grafana. Seemed liked there was significant heap usage on all machines. So I made the changes to dump heap and redeployed all instances so that if this happens again I have someplace to begin an investigation.

Since I was sceptic about the recent changes, I started my local development build and used postman to run 1000 iterations of the API that contained those changes. While monitoring JVM heap usage.

MemoryLeakHeapUsage

Since this looked like a memory leak, but to be sure that it was the last change, I reverted to the last commit and re-ran this test. And this time…

MemoryHeapUsage

Add this point it was quite clear the code path that was causing this issue. But to know where is exactly all that memory leaking, I took a heap dump. Just loading this heap dump in MAT and generating a memory leak suspect report looked like this.

LeakSuspectReport

This report needed no investigation and told us which class and the corresponding field occupied ~90% of the heap, uugghhhh!!! Something was wrong. So in I downloaded the source for JerseyClient. This class does contains what was mentioned in the leak suspects report :

private final LinkedBlockingDeque<ShutdownHook> shutdownHooks = new LinkedBlockingDeque<ShutdownHook>();

I clearly understood that due to some incorrect usage I was adding elements to this deque somehow but never removing from it. So I looked in the source for anything that clears/removes elements from this deque. However,

void registerShutdownHook(final ShutdownHook shutdownHook) {
checkNotClosed();
shutdownHooks.push(shutdownHook);
}

was the only function that seems to add “Shutdownhooks” to the deque. This looked like this deque contains something that needs to be closed when the application dies/ or the parent object dies.

I quickly added a breakpoint here and curl the API to validate that every API call adds an element to this deque. This object probably never get garbage collected in the application’s lifetime and hence the memory leak.

Since the deque was a private member of the class and there were no functions that called a remove/clear on the deque, It was apparent that it was due to incorrect usage of some API rather than a bug in JerseyClient. After a long session of debugging, tracing the origin and following the stack trace. I realised it was this line of code.

target.register(GZipEncoder.class);

Every time any modification is called on a ClientConfig, a new ClientRuntime object is created and added as a shutdown hook to the deque. This also happens when we register to the WebTarget like I was doing here.

Alright, however, this was required because I knew I required GZip encoding support for the new service call. Hmmm… I was missing some key information about usage to the Jersey Client, an approach to eradicate the issue could be to create a client on each API call. That will solve the leak as all client created would get garbage collected when required but will end up creating a lot of temporary objects. Also, the official documentation of JerseryClient mentions that an instance of JerseyClient should be resued.

Clients are heavy-weight objects that manage the client-side communication infrastructure. Initialization as well as disposal of a Client instance may be a rather expensive operation. It is therefore advised to construct only a small number of Client instances in the application. Client instances must be properly closed before being disposed to avoid leaking resources.

http://javadox.com/javax.ws.rs/javax.ws.rs-api/2.0-m14/javax/ws/rs/client/Client.html

After some time while browsing through Jersey source, I realised that both target and client are implementing the same interface that has register function and so I modified the client to register this class once while initialising.

client.register(GZipEncoder.class); instead of target.register( GZipEncoder.class) 

I rebuild the app, restarted the service and re-ran the tests. The registerShutdownHook was just called once and no breakpoint hits on every API call, the snapshot of heap usage seems to have reverted to normal.

I learned that the only place where we try to do a quick fix or corner cut will end up consuming all the time we set out to save. Secondly, the usage of libraries does not come for free, one might need to pay special attention to the usage of it as well. Overall, this was quite some fun.

Later, I found out a few articles like this one (https://blogs.oracle.com/japod/how-to-use-jersey-client-efficiently), that despite its title being how to use jersey client efficiently instead is actually about how not to use the JerseyClient.

Also, at the time of writing this blog, I realised in the newer version of JerseyClient, the declaration of shutdownHooks has been changed from

private final LinkedBlockingDeque<ShutdownHook> shutdownHooks

to

private final LinkedBlockingDeque<WeakReference<JerseyClient.ShutdownHook>> shutdownHooks 

Now I can understand why this change was required. I guess it was not just me !! 😀

Leave a Reply

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Catch Me On Social !!