Monitoring Azul Zing

An important metric to monitor for many applications is the application response time. For others it is throughput or both combined. Those need to be measured from the outside with the perspective of the user. Many components and factors have an influence on those metrics and the JVM is only one of them, but an important one.

In addition, monitoring the JVM helps to detect trends early, before the application metrics are affected.

This article talks about ways to monitor Azul Zing.

 

Graphical Interactive Tools

Azul Zing includes the following tools for analyzing the JVM metrics and performance:

Those tools are intended for interactive analysis or analyzing recorded data or logs.

For continuous monitoring of production systems, the methods described below exists:

 

Standard JVM Pausetime Logging

The following OpenJDK standard command line option for monitoring application pauses caused by the GC and Safepoints also works on Azul Zing. On Azul Zing no problematic pauses are expected as with its special GC algorithm the short synchronization steps for phase transitions inside the JVM are independent of the heap size and in most practical use cases are not exceeding single digit ms (1/1000 s).

On Java 8:

-XX:+PrintGCApplicationStoppedTime
-Xloggc:gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=20M

Note, that -XX:+PrintGCDetails is not needed for Azul Zing as all important information is already provided without it. Only for OpenJDK that additional flag is needed.

Example line from the Java 8 gc.log:

$ grep -E 'Total time.*stopped' gc.log
12.836: Total time for which application threads were stopped: 0.0000602 seconds, Stopping threads took: 0.0000074 seconds

On Java 11 and newer, just the following is used, which includes log rotation by default:

-Xlog:gc,safepoint:gc.log

Note, that the detail logging by using the keyword gc* instead of gc is not needed for Azul Zing as all important information is already provided without it. Only for OpenJDK that additional setting is needed.

Details about changing the log rotation defaults: https://docs.azul.com/prime/Unified-GC-Logging

Example line from the Java 11 gc.log:

grep safepoint.*Total gc.log \
| sed -E 's/.*\[([0-9\.]+)s\].*Total[a-z ]*: ([0-9\.]+ [a-z]+).*/\1 \2/'

Often a statistical analysis of those times is interesting, for example the percentiles to compare with performance SLAs. Here an example using the open source datamash tool to collect the P50, P95, P99 and maximum of the pause times and skipping the first 60 seconds of warmup time:

grep safepoint.*Total gc.log \
| sed -E 's/.*\[([0-9\.]+)s\].*Total[a-z ]*: ([0-9\.]+ [a-z]+).*/\1 \2/' \
| awk '$1 > 60 { print $1,$2,$3 }' \
| datamash -t' ' perc:50 2 perc:95 2 perc:99 2 max 2

Output example:

0.0002 0.0034 0.0035 0.0160

Interpretation of this Example output:
50% of all GC pauses are below 0.2ms, 95% below 3.4ms, 99% below 3.5ms and all are below 16ms.

Note: The datamash tool won't calculate percentiles correctly when using decimal numbers as parameter. For example perc:99.9 results in the same as perc:99.

Standard OpenJDK and Azul Zing JMX MBeans

If your monitoring system is already tracking the Java application heap utilization over JMX, for example to detect application memory leaks, the standard Java MXBeans like the following can be continued to be used on Azul Zing:

  • java.lang -> type=MemoryPool
  • java.lang -> type=Memory
  • java.lang.GarbageCollector -> GPGC New Cycles: CollectionTime
  • java.lang.GarbageCollector -> GPGC New Pauses: CollectionTime
  • java.lang.GarbageCollector -> GPGC Old Cycles: CollectionTime
  • java.lang.GarbageCollector -> GPGC Old Pauses: CollectionTime

The entries with Cycles and Pauses in their names are standard OpenJDK names since Java 11 and are also available on Azul Zing. Cycles tracks the concurrent GC activity which is not pausing the application and Pauses tracks the application-affecting pauses. To receive this differentiation between Pauses and Cycles also on Java 8 with Azul Zing, add -XX:-GPGCReportLegacyGarbageCollectorMXBean.

As Azul Zing's Garbage Collector always runs concurrently to the application, the application continues to allocate memory while the GC is running. That's a normal and desired behavior for any concurrent GC as it increases application performance. But it means on the monitoring side that you'll notice higher values on the MXBean attribute "used" on Azul Zing than on other GC implementations which add pauses. But for detecting application memory leaks and long term trend analysis those metrics are still valuable. You might only need to increase the percentage of the warning threshold to avoid false alerts.

The native memory usage of the java process without the Java heap is also available and especially for Zing a good way to monitor the real memory usage, combined with the Java heap, as the Linux RSS metric seen by Linux tools will show wrong results due the memory multi-mapping used by Zing: java.lang,type=Memory,NonHeapMemoryUsage.committed

Additional note when enabling the ZST component where the feature Contingency Memory is available by default: Under high memory pressure, the attribute value "used" can grow larger than 100% on Azul Zing with ZST as there Contingency Memory is available to cover situations when applications temporarily need more heap than assigned by -Xmx. If your monitoring calculates percentages, that detail is important to avoid calculation errors or unintended negative values. Further details: on MemoryPoolMXBean the attribute "max" stays constant and equals -Xmx but on MemoryMXBean the attribute "max" will also grow above -Xmx when Contingency Memory is actively used under high memory pressure.

 

Azul Zing specific metrics from JMX MBeans

To monitor the CPU activity of Azul Zing's concurrent GC in more detail over JMX or to view the stress on the GC, special MXBeans are available by adding the following command line option:

-XX:+UseZingMXBeans

The primarily interesting metrics there to monitor are the following:

  • com.azul.zing -> GarbageCollector -> GPGC New -> PercentageOfTimeCollectorIsRunning and GPGC Old: PercentageOfTimeCollectorIsRunning: Those 2 are useful for analyzing the CPU activity caused by the GC and the stress on the GC. Setting a warning threshold when it exceeds 30% for more than 5 minutes is a good starting point. Occasional short spikes to 100% are not problematic.
  • com.azul.zing -> Memory: PercentJavaHeapOccupiedAfterCollection: This one is useful for analyzing the actual memory requirements of the application and to detect application memory leaks. Setting a warning threshold when it exceeds 70% for 5 minutes is a good starting point.
  • com.azul.zing -> GarbageCollector -> GPGC New: CollectionCount: This is useful as sanity check to avoid false alarms for monitoring systems polling in fixed intervals. Only if this count increased, new metric values are available for all the other metrics. If the count didn't increase, the provided metric values in all the other metrics are outdated and shouldn't be used for warning notifications. This is a rare situation, but can happen if a Java application was active and then suddenly goes into an idle state where the interval between GCs can be as long as 10 minutes and only with a GC new data will be available in the metrics.
  • In addition, the following metrics exist to monitor the GC pauses and the concurrent duration while the GC is running in parallel to the application without pausing it. For most use cases just the above listed PercentageOfTimeCollectorIsRunning is sufficient, but if you want to have a more detailed insight, add the following, which are cumulative metrics, not gauge metrics as the previous ones:
    com.azul.zing -> GarbageCollector -> GPGC New: CollectorCumulativePauseTimeSec
    com.azul.zing -> GarbageCollector -> GPGC Old: CollectorCumulativePauseTimeSec
    com.azul.zing -> GarbageCollector -> GPGC New: CollectorCumulativeRunningTimeSec
    com.azul.zing -> GarbageCollector -> GPGC Old: CollectorCumulativeRunningTimeSec

For more details regarding ZingMXBeans and its most important metrics please start reading in the section Monitoring the ZVM Using JMX and Zing MXBeans on the following page https://docs.azul.com/prime/MXBeans

An implementation using ZingMXBeans attachable as -javaagent to existing applications is available on https://github.com/giltene/ZingGCReporter.

 

Azul Zing Metrics from the GC.log file

The same metrics offered over JMX are also available in Azul Zing's garbage collection log file activated by adding the standard JDK parameter -Xloggc:gc.log. Azul Zing includes the example script /opt/zing/zing-jdk17/etc/gcLogScraper to extract metrics from that log file. 

Example usage to extract current heap utilization in MBytes (NewGen + OldGen + PermGen), CodeCache usage, Linux memory usage outside of the heap in kBytes and application thread delays: 

$ tail -n+0 -f gc.log | /opt/zing/zing-jdk17/etc/gcLogScraper -l GC -c 'logtime=%{logtime} curheap_MB=%{heap#cur} maxheap_MB=%{heap#max} codecache_MB=%{heap#codec} rss_kB=%{heap#rss} maxdelay_s=%{delay#max}'
logtime=29.701s curheap_MB=6968 maxheap_MB=20480 codecache_MB=1 rss_kB=117817 maxdelay_s=-
logtime=34.316s curheap_MB=8834 maxheap_MB=20480 codecache_MB=1 rss_kB=134111 maxdelay_s=0.000
logtime=36.048s curheap_MB=9588 maxheap_MB=20480 codecache_MB=1 rss_kB=139652 maxdelay_s=0.000
logtime=38.341s curheap_MB=10538 maxheap_MB=20480 codecache_MB=1 rss_kB=147521 maxdelay_s=0.000

The metric rss_KB shows the total java process memory usage. Here in the gc.log, and also over JMX MBean, this metric is correct, compared to the RSS shown by Linux tools like top or ps which will show wrong results for Zing due to memory multi-mapping.

Note, that the metric %{heap#cur} extracted above is the total Java heap utilization which includes actively used objects (live objects) and already unreferenced objects (garbage). Especially for concurrent Garbage Collectors, like the one used by Azul Zing, a different metric is practically more relevant, especially for monitoring real application Java heap usage, for example to detect memory leaks.

That metric, the live set, is extracted in the following example in MBytes. This example also extracts the percentage of time the concurrent GC is active in parallel to the application, which is next to the live set the other most important metric for concurrent GCs. Both metrics are separately tracked for the NewGen and OldGen memory pools and the total resource usage is the sum in both metric pairs. In this example the gc.log was written with the wallclock time instead of seconds since java process start as time stamps using -Xlog:gc,safepoint:gc.log:time,uptime,level,tags for writing it:

$ tail -n+0 -f gc.log | /opt/zing/zing-jdk17/etc/gcLogScraper -l GC -c '%{logtime} %{type} %{intracycle#sec} %{intercycle#sec} %{heap#live}' | awk '{ printf("logtime=%s type=%s gctime_%=%.0f liveheap_MB=%d\n", $1, $2, $3 / ($3 + $4) * 100, $5) }'
logtime=2023-11-01T16:08:37.853+0000 type=Old gctime_%=81 liveheap_MB=21
logtime=2023-11-01T16:08:37.900+0000 type=New gctime_%=89 liveheap_MB=764
logtime=2023-11-01T16:08:38.120+0000 type=New gctime_%=96 liveheap_MB=1030
logtime=2023-11-01T16:08:38.122+0000 type=Old gctime_%=81 liveheap_MB=23
logtime=2023-11-01T16:08:38.234+0000 type=New gctime_%=95 liveheap_MB=1086

Next to the GC another component of any JVM using CPU resources in parallel to the application is the JIT Compiler. The Azul Zing GC log also records metrics about its activity. Here an example to track the number of active compiler threads per compiler tier and CPU time spend in JIT compilation:

$ tail -n+0 -f gc.log | /opt/zing/zing-jdk17/etc/gcLogScraper -l COMPSTATS -c 'logtime=%{logtime} jitcomp_tier=%{Tier} jitcomp_activethreads=%{Active} jitcomp_cumulated_cputime_ms=%{CPUTime}'
logtime=12.939s jitcomp_tier=1 jitcomp_activethreads=0 jitcomp_cumulated_cputime_ms=129
logtime=12.939s jitcomp_tier=2 jitcomp_activethreads=3 jitcomp_cumulated_cputime_ms=17569
logtime=16.288s jitcomp_tier=1 jitcomp_activethreads=2 jitcomp_cumulated_cputime_ms=197
logtime=16.288s jitcomp_tier=1 jitcomp_activethreads=1 jitcomp_cumulated_cputime_ms=197
logtime=16.544s jitcomp_tier=1 jitcomp_activethreads=0 jitcomp_cumulated_cputime_ms=198
logtime=16.544s jitcomp_tier=2 jitcomp_activethreads=2 jitcomp_cumulated_cputime_ms=18923
logtime=21.036s jitcomp_tier=1 jitcomp_activethreads=0 jitcomp_cumulated_cputime_ms=245

Further details: https://docs.azul.com/prime/GC-Log-Scraper

 

Add Comment

Comments

0 comments

Please sign in to leave a comment.

Was this article helpful?
0 out of 0 found this helpful