Troubleshooting Application Pauses

Performance troubleshooting and engineering project for a US Financial Software Company

Project Synopsis

A fast growing financial software company based out the US had an issue with the production instance of their application where the application used to hang for 4-8 seconds after running for 10 minutes. While the application was in this state, none of the users could do any operation on the application. This was a growing concern for the customer services group of the company as well as the technology team. The in-house technical team had tried multiple fixes on the application but none of those helped fix the problem.

The client needed urgent help in dealing with the following challenges:

  • Analyzing the root cause of the application pauses
  • Providing a quick fix for the problem
  • Ensuring the customers do not face the application hangs any more

Rare Mile Solution

Rare Mile spent a few days analyzing the problem and figured out that the pauses were happening because of the perm gen space of the JVM filling up and forcing a full garbage collection. While the Full GC was going on, the JVM paused all the running threads till it was able to get some free perm gen space. This usually took 4 to 5 seconds and during this time the application appeared to be paused. We also figured out that besides the perm gen, sometimes both the survivor spaces in the JVM heap also filled up and triggered a full GC, leading to similar problems.

Under normal circumstances, when an application has run for a few hours or days, it's perm gen space stabilizes because it has loaded all the class definitions needed by the application. An interesting pattern that we observed for this client was that even after running for days, the perm gen space never stabilized and in fact grew at a very brisk pace. After analyzing the application code, we found the root cause of the problem in a class which dynamically generated new classes on the fly on every user request. These dynamically generated classes had cryptic names which lead to the JVM loading the new definition every time. As a result of this, the perm gen never got to stabilize.

After fixing the faulty code, Rare Mile was able to completely eliminate application pauses and control the perm gen growth. The JVM heap sizes also had to be re-calculated and configured to ensure that there is enough space in the survivor heap to avoid full GCs.

Project Highlights

  • Completed removed the application pauses problem
  • Optimized code for better performance
  • Implemented the complete analysis and code fixing in 3 weeks time
  • Executed the assignment in a contingency model with outcome linked pricing

About The Project

Bail out of the largest sale day for a UK based eCommerce company

Technologies Used

Java Backend
Adobe Flex
Drools Based Rules Engine

Client Details