Profiling is "easy" and "simple" in the Java world....
As we began the week with a severe CPU Utilization on all of our boxes, even when the transactions counts was low.
The application in focus is a Java based web framework called LifeRay and its running on several nodes (Inside Docker)
The first Step I did was to open the require JMX ports and set the configuration, so that are able to connect remotely with a profiler (below example is a simple without SSL/Authentication for sake of demo)
-Dcom.sun.management.jmxremote=true
-Dcom.sun.management.jmxremote.port=5000
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
I decided to use VisualVM since it's free and simple to use (from past experience).
During a high CPU % (Always), i connected to the remote machine and to the relevant PID and did a CPU profiling (live) and saw the following ( sorted by Self Time (CPU) )
looks like 50% of the CPU time related to the cluster (using JGroups), so we started to review cluster config and operation but couldn't really find anything special.
This raised the concern that maybe the profiling is not good enough, which led to me to find and read this great article "Why (Most) Sampling Java Profilers Are F****g Terrible".
I went a head a ran "top -H" which gives me the view of all the light wight processes (AKA Threads):
Above threads were majority of the time on the top and mostly consuming the CPU!
Couple of options in hand:
- Use a different profiler
- Async-profiler - Doesn't require restart of service
- Honest-profiler - Requires restart of servide
- Simpler approach of correlated thread dumps and core thread % view
I chose to start with the simpler approach of correlation data, which consist of couple of actions done on the same time.
Using VisualVM I did a thread dump, but we didn't see the NID (Native Thread ID), so we need to use the tools to do a thread dump (JStack).
On the server we didn't have jstack (JDK) so we went to use the javatools.jar and did the following
The list is long so I filtered the relevant ones:
Pay attention that nid is in hex, while the light weight PID (Thread) is in Decimal - 0x39 --> 57
Since the compiler is working a lot, we went to check the CodeCache and see where we stand
Above shows we are around 90mb, max was almost 96mb - weird since I expected higher number (Java 1.8 has 256 MB as default),
Looking at the command line argument we saw that someone limit the size to 96MB:
-XX:InitialCodeCacheSize=32m
-XX:ReservedCodeCacheSize=96m
Our next steps include experiments and monitoring:
- Increasing the size
- Reviewing the code cache pool and our entire process metrics
There were a lot of time wasters, be aware and don't got there:
- Originally we ran the "top -H" on the host instead of the docker --> the PID is different and cannot be correlated !
- We were not aware of the JAVA Profiler basically biased view! ( I didn't see it coming )
Good material to read
Profiling is "easy" and "simple" in the Java world....
Reviewed by Ran Davidovitz
on
12:54 AM
Rating:
No comments:
Post a Comment