Debugging Production Peformance Issues With The Power Of The Thread Dump using jstack
Until recently I have never had to do much performance tuning. On production systems and when ever I have, I have always been able replicate the problem in the development environment. Recently I ran across a problem was happening in one production environment but not in another production environemnt. You see we have our live production site and our back up site. Our back up site used to be our production site, but we moved to the new environment a few months ago. users started to get time outs when they were trying to access pages that they were not authorised to instead of getting 401s.
I was at a loss on how to debug this issue. However, one of the other developers showed me how to use thread dumps to see what was going on the server at the time the problem was happening. Jstack comes with the Java SDK. It prints out what looks like a stack trace for all of the threads in a particular JVM. What we did is we went to one of the offending pages and while we were waiting for the page to load we ran jstack repeatedly and sent the output to different files which we then compared to see if any of our code was causing the problem.
Our servers are linux so we ran the following command to get our application server's process id.
ps -ef | grep java
The output was something like this:
developers 3456 1 41 07:11 pts/0 00:00:16 /usr/lib/jvm/java-6-sun-126.96.36.199/bin/java -cp /home/developers/glassfish3/glassfish/modules/glassfish.jar -XX:+UnlockDiagnosticVMOptions -XX:MaxPermSize=192m -XX:NewRatio=2 -Xmx512m -javaagent:/home/developers/glassfish3/glassfish/lib/monitor/flashlight-agent.jar -client -Dfelix.fileinstall.disableConfigSave=false -Djavax.net.ssl.keyStore=/home/developers/glass....
Where 3456 is the process id. We then used jstack to take a bunch of thread dumps like so:
jstack -l 3456 > ~/tdump1.txt
jstack -l 3456 > ~/tdump2.txt
We took more than 2 dumps, but we were able to see where our code was stuck. We were doing something silly. We were catching the exception caused by the 401 and emailing it. We changed the code to write to a log file instead. This problem was part of the cause of some of our pages disappearing from the google index.
This entry was posted in Performance Tuning
. Bookmark the permalink