I have a general question about the suitability of Zing JVM to our use case, rather than the trial itself, so feel free to move this post to another thread if needed.
We have an application, Gerrit (https://www.gerritcodereview.com), a git server and code review tool, hosting a fairly high number of git repositories and used by (a big number of) developers in our company.
Traffic in the server has been incrementing slowly but steadily in the last years due to increased use of CI pipelines and automation which brings an increased load to the server.
We have been tuning the JVM GC process as we have identified it as one of the sources of instabilities and performance degradation. Even if, in theory, our application does not have real-time requirements, the increase in automation have brought in a new set of performance requirements as scripts can do sequential and parallel operations much faster than any human :).
Right now, we are running this application in a 48 CPU machine with 512GB of RAM, latest RHEL6.x, Oracle JDK 8 (Gerrit is still not compatible with Java 9) and the G1GC algorithm with the following parameters:
These JVM parameters have been the fruit of several "waves" of tuning and were working acceptably well for us (average pauses of 400ms, max pauses around 1-2 secs, no full GCs, throughput around 95%) until now. In the last several weeks, however, we have noticed pauses skyrocketing (in the tens of seconds), throughput going down to low 90s (some days even around mid-80s) and general degradation of performance.
One of the factors that have been identified as putting pressure on the G1GC algorithm is a high rate of object creation (average 5-6GB/sec, peaks of 12-15GB/sec) and object promotion (12-15 MB/sec).
Another problem we have identified is the presence of long young generation pauses which are dominated by the object copy, i.e., moving objects from one memory area to another (as expected) but sometimes spending long times in the kernel space, i.e., sys times are longer (by a 8-10x factor) than user times. We are currently investigating this with our Linux team to try to troubleshoot and fix this issue.
Finally, this application uses Lucene (as a library) to build and keep some indexes that are heavily used; we have identified that indexing and querying the indexes produce big objects (larger than 16MB, sometimes around 20-40MB) that cause humongous allocations; depending on the load, these can count up to 25-30% of all the GC causes. As the heap is that large, we have verified that the maximal size for heap region size is already used.
We have been reading a bit about the C4 Collector and it seems it can help us to improve the application throughput and performance but we have some questions related to the issues exposed above:
* How well can Zing and the C4 algorithm cope with high allocation and promotion rates?
* Would Zing be affected by the (still under investigation) issue of high sys times when copying memory objects?
* How is the size of the memory areas handled? Could it also be an issue to have these big objects created (and copied around) in memory? In another way, can Zing and the C4 algorithm be also affected by these humongous allocations?
Sorry for the long thread and thanks for your attention,
Please sign in to leave a comment.