An important metric to monitor for many applications is the application response time. For others it is throughput or both combined. Those need to be measured from the outside with the perspective of the user. Many components and factors have an influence on those metrics and the JVM is only one of them, but an important one.
In addition, monitoring the JVM helps to detect trends early, before the application metrics are affected.
This article talks about ways to monitor Azul Platform Prime.
Graphical Interactive Tools
Azul Platform Prime includes the following tools for analyzing the JVM metrics and performance:
- GCLogAnalyzer
- FlightRecorder (JFR) combined with Azul Mission Control
Those tools are intended for interactive analysis or analyzing recorded data or logs.
For continuous monitoring of production systems, the methods described below exists:
Standard JVM Pausetime Logging
The following OpenJDK standard command line option for monitoring application pauses caused by the GC and Safepoints also works on Azul Platform Prime. On Azul Platform Prime no problematic pauses are expected as with its special GC algorithm the short synchronization steps for phase transitions inside the JVM are independent of the heap size and in most practical use cases are not exceeding single digit ms (1/1000 s).
On Java 8:
-XX:+PrintGCApplicationStoppedTime
-Xloggc:gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=20M
Example line from the Java 8 gc.log:
$ grep -E 'Total time.*stopped' gc.log
12.836: Total time for which application threads were stopped: 0.0000602 seconds, Stopping threads took: 0.0000074 seconds
On Java 11 and newer, just the following is used, which includes log rotation by default:
-Xlog:gc,safepoint:gc.log
Details about changing the log rotation defaults: https://docs.azul.com/prime/Unified-GC-Logging
Example line from the Java 11 gc.log:
grep safepoint.*Total gc.log \
| sed -E 's/.*\[([0-9\.]+)s\].*Total[a-z ]*: ([0-9\.]+ [a-z]+).*/\1 \2/'
Often a statistical analysis of those times is interesting, for example the percentiles to compare with performance SLAs. Here an example using the open source datamash tool to collect the P50, P95, P99 and maximum of the pause times and skipping the first 60 seconds of warmup time:
grep safepoint.*Total gc.log \ | sed -E 's/.*\[([0-9\.]+)s\].*Total[a-z ]*: ([0-9\.]+ [a-z]+).*/\1 \2/' \ | awk '$1 > 60 { print $1,$2,$3 }' \ | datamash -t' ' perc:50 2 perc:95 2 perc:99 2 max 2
Output example:
0.0002 0.0034 0.0035 0.0160
Interpretation of this Example output:
50% of all GC pauses are below 0.2ms, 95% below 3.4ms, 99% below 3.5ms and all are below 16ms.
Note: The datamash tool won't calculate percentiles correctly when using decimal numbers as parameter. For example perc:99.9 results in the same as perc:99.
Standard JVM JMX MBeans
If your monitoring system is already tracking the Java application heap utilization over JMX, for example to detect application memory leaks, the standard Java MXBeans like the following can be continued to be used on Azul Platform Prime:
- java.lang.management -> MemoryPool
- java.lang.management -> Memory
- java.lang.GarbageCollector -> GPGC New Cycles: CollectionTime
- java.lang.GarbageCollector -> GPGC New Pauses: CollectionTime
- java.lang.GarbageCollector -> GPGC Old Cycles: CollectionTime
- java.lang.GarbageCollector -> GPGC Old Pauses: CollectionTime
The entries with Cycles and Pauses in their names are standard OpenJDK names since Java 11 and are also available on Azul Platform Prime. Cycles tracks the concurrent GC activity which is not pausing the application and Pauses tracks the application-affecting pauses. To receive this differentiation between Pauses and Cycles also on Java 8 with Azul Platform Prime, add -XX:-GPGCReportLegacyGarbageCollectorMXBean.
As Azul Platform Prime's Garbage Collector always runs concurrently to the application, the application continues to allocate memory while the GC is running. That's a normal and desired behavior for any concurrent GC as it increases application performance. But it means on the monitoring side that you'll notice higher values on the MXBean attribute "used" on Azul Platform Prime than on other GC implementations which add pauses. But for detecting application memory leaks and long term trend analysis those metrics are still valuable. You might only need to increase the percentage of the warning threshold to avoid false alerts.
Additional note when enabling Azul Platform Prime's ZST component where the feature Contingency Memory is available by default: Under high memory pressure, the attribute value "used" can grow larger than 100% on Azul Platform Prime with ZST as there Contingency Memory is available to cover situations when applications temporarily need more heap than assigned by -Xmx. If your monitoring calculates percentages, that detail is important to avoid calculation errors or unintended negative values. Further details: on MemoryPoolMXBean the attribute "max" stays constant and equals -Xmx but on MemoryMXBean the attribute "max" will also grow above -Xmx when Contingency Memory is actively used under high memory pressure.
Azul Platform Prime specific metrics from JMX MBeans
To monitor the CPU activity of Azul Platform Prime's concurrent GC in more detail over JMX or to view the stress on the GC, special MXBeans are available by adding the following command line option:
-XX:+UseZingMXBeans
The primarily interesting metrics there to monitor are the following:
- com.azul.zing -> GarbageCollector -> GPGC New -> PercentageOfTimeCollectorIsRunning and GPGC Old: PercentageOfTimeCollectorIsRunning: Those 2 are useful for analyzing the CPU activity caused by the GC and the stress on the GC. Setting a warning threshold when it exceeds 30% for more than 5 minutes is a good starting point. Occasional short spikes to 100% are not problematic.
- com.azul.zing -> Memory: PercentJavaHeapOccupiedAfterCollection: This one is useful for analyzing the actual memory requirements of the application and to detect application memory leaks. Setting a warning threshold when it exceeds 70% for 5 minutes is a good starting point.
- com.azul.zing -> GarbageCollector -> GPGC New: CollectionCount: This is useful as sanity check to avoid false alarms for monitoring systems polling in fixed intervals. Only if this count increased, new metric values are available for all the other metrics. If the count didn't increase, the provided metric values in all the other metrics are outdated and shouldn't be used for warning notifications. This is a rare situation, but can happen if a Java application was active and then suddenly goes into an idle state where the interval between GCs can be as long as 10 minutes and only with a GC new data will be available in the metrics.
For more details regarding ZingMXBeans and its most important metrics please start reading in the section Monitoring the ZVM Using JMX and Zing MXBeans on the following page https://docs.azul.com/prime/MXBeans
An implementation using ZingMXBeans attachable as -javaagent to existing applications is available on https://github.com/giltene/ZingGCReporter.
Azul Platform Prime Metrics from the GC.log file
The same metrics offered over JMX are also available in Azul Platform Prime's garbage collection log file activated by adding the standard JDK parameter -Xloggc:gc.log. Azul Platform Prime includes the example script /opt/zing/zing-jdk8/etc/gcLogScraper to extract metrics from that log file.
Example usage to extract current heap utilization in MBytes (NewGen + OldGen + PermGen), CodeCache usage, Linux memory usage outside of the heap in kBytes and application thread delays:
$ tail -n+0 -f gc.log | /opt/zing/zing-jdk17/etc/gcLogScraper -l GC -c 'jvmuptime_s=%{end#end} curheap_MB=%{heap#cur} maxheap_MB=%{heap#max} codecache_MB=%{heap#codec} rss_kB=%{heap#rss} maxdelay_s=%{delay#max}'
jvmuptime_s=29.701548 curheap_MB=6968 maxheap_MB=20480 codecache_MB=1 rss_kB=117817 maxdelay_s=-
jvmuptime_s=34.316951 curheap_MB=8834 maxheap_MB=20480 codecache_MB=1 rss_kB=134111 maxdelay_s=0.000
jvmuptime_s=36.048295 curheap_MB=9588 maxheap_MB=20480 codecache_MB=1 rss_kB=139652 maxdelay_s=0.000
jvmuptime_s=38.341703 curheap_MB=10538 maxheap_MB=20480 codecache_MB=1 rss_kB=147521 maxdelay_s=0.000
Example usage to extract the percentage of time the concurrent GC is active in parallel to the application and the live Java objects in MBytes. Those are separately tracked for the NewGen and OldGen memory pools and GC threads and the total resource usage is the sum in both metric pairs:
$ tail -n+0 -f gc.log | /opt/zing/zing-jdk17/etc/gcLogScraper -l GC -c '%{end#end} %{type} %{intracycle#sec} %{intercycle#sec} %{heap#live}' | grep New | awk '{ printf("jvmuptime_ms=%d type=%s gctime_%=%.0f liveheap_MB=%d\n", $1 * 1000, $2, $3 / ($3 + $4) * 100, $5) }'
jvmuptime_ms=62664 type=New gctime_%=44 liveheap_MB=4242
jvmuptime_ms=65855 type=New gctime_%=33 liveheap_MB=4454
jvmuptime_ms=67806 type=New gctime_%=38 liveheap_MB=4373
$ tail -n+0 -f gc.log | /opt/zing/zing-jdk17/etc/gcLogScraper -l GC -c '%{end#end} %{type} %{intracycle#sec} %{intercycle#sec} %{heap#live}' | grep Old | awk '{ printf("jvmuptime_ms=%d type=%s gctime_%=%.0f liveheap_MB=%d\n", $1 * 1000, $2, $3 / ($3 + $4) * 100, $5) }'
jvmuptime_ms=21032 type=Old gctime_%=31 liveheap_MB=2043
jvmuptime_ms=25274 type=Old gctime_%=28 liveheap_MB=1667
jvmuptime_ms=29701 type=Old gctime_%=27 liveheap_MB=1599
Next to the GC another component of any JVM using CPU resources in parallel to the application is the JIT Compiler. The Azul Platform Prime GC log also records metrics about its activity. Here an example to track the number of active compiler threads per compiler tier and CPU time spend in JIT compilation:
$ tail -n+0 -f gc.log | /opt/zing/zing-jdk17/etc/gcLogScraper -l COMPSTATS -c 'jvmuptime_s=%{logtime} jitcomp_tier=%{Tier} jitcomp_activethreads=%{Active} jitcomp_cumulated_cputime_ms=%{CPUTime}'
jvmuptime_s=12.939 jitcomp_tier=1 jitcomp_activethreads=0 jitcomp_cumulated_cputime_ms=129
jvmuptime_s=12.939 jitcomp_tier=2 jitcomp_activethreads=3 jitcomp_cumulated_cputime_ms=17569
jvmuptime_s=16.288 jitcomp_tier=1 jitcomp_activethreads=2 jitcomp_cumulated_cputime_ms=197
jvmuptime_s=16.288 jitcomp_tier=1 jitcomp_activethreads=1 jitcomp_cumulated_cputime_ms=197
jvmuptime_s=16.544 jitcomp_tier=1 jitcomp_activethreads=0 jitcomp_cumulated_cputime_ms=198
jvmuptime_s=16.544 jitcomp_tier=2 jitcomp_activethreads=2 jitcomp_cumulated_cputime_ms=18923
jvmuptime_s=21.036 jitcomp_tier=1 jitcomp_activethreads=0 jitcomp_cumulated_cputime_ms=245
Further details: https://docs.azul.com/prime/GC-Log-Scraper
Add Comment
Comments
Please sign in to leave a comment.