I am surprised by the high number of Java developers I meet that do not know what a Java Thread Dump is or how to generate one. I find it a very powerful tool, and it is always available as part of the JVM. I haven't played much with Java 5 yet, but it comes with jstack, a new tool that makes it easier to generate thread dumps.
Earlier this year, I was working on a load test for for a well-known airline. We were tunning the environment all we could, monitoring and profiling to know where to focus our optimization efforts. The solution involved a fairly high stack: Apache httpd, WebSphere, FuegoBPM, Tibco messaging, Oracle RAC.
The system was holding load pretty well up to a certain point in which it immediatly halted and stopped processing new requests. Every time we run the load testing scripts we experienced the same symptoms. Not even the official testers --with allegedly powerful testing and monitoring tools-- were able to identify the cause of the problem.
So, I decided to get a few Thread Dumps of WebSphere's JVM. On Unix, you do "kill -3 <pid>
" and the dump goes to WebSphere's native_stdout.log
. We
inspected the dumps but couldn't identify dead-locks or any other
obvious anomaly, although the answer was right before our eyes.
Since the thread dump was daunting, I decided to spend a bit of my time on doing some fun work: I
wrote a short Ruby script to create a graphical representation of the
dump, showing the locks each thread was holding, and the locks each
thread was waiting on. The heavy work of actually drawing and laying
out the graph was left to GraphViz's dot
.
Once the script was usable, I generated the graph for the above mentioned dump.
To our delight, the graph immediately exposed the problem: See for yourself.
Every thread was waiting on a lock that was held by thread Servlet.Engine.Transports 1405. What was this thread doing? Here's the stack, taken from the thread dump:
"Servlet.Engine.Transports : 1405" daemon prio=5 tid=0x020cee40
nid=0x515ea runnable [8648f000..864919c0]
at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:221)
at java.io.File.exists(File.java:680)
at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:887)
at sun.misc.URLClassPath.getResource(URLClassPath.java:157)
at java.net.URLClassLoader$1.run(URLClassLoader.java:191)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:187)
at com.ibm.ws.bootstrap.ExtClassLoader.findClass(ExtClassLoader.java:79)
at java.lang.ClassLoader.loadClass(ClassLoader.java:289)
- locked <0xb4c1be08> (a com.ibm.ws.bootstrap.ExtClassLoader)
at java.lang.ClassLoader.loadClass(ClassLoader.java:235)
...
accessing the filesystem! After sharing the findings with the rest of the team, one of the admins revealed that the WebSphere installation was on a mounted NFS drive... WebSphere's JVM was trying to reload some .jar files, but it choked NFS under heavy load.
Unfortunately, moving WebSphere to a non-NFS was not trivial (you know... a big company, with procedures, standards, bureaucracy), and since the managers were already happy with the results of the load test, we never had a chance to run it all again. I am still wondering how much load it would have standed without NFS (and why was WebSphere reading those .jars so often to begin with?).
So, we found the bottle-neck thanks to a thread dump, and a picture is worth a thousand lines of thread dumps :-)
Here is the quick and dirty script: tdg.rb