问题现象
应用的 QPS 很低,低到什么程度呢。 QPM 为个位数,但是应用确每隔 7 个小时左右进行一次 FullGc,虽然 FullGc 的频率不是很高,但也是不能接受的。
当时机器的一些关键配置为:
机器:16G 内存
JVM 设置:-Xmx8G
垃圾收集器:G1
分析步骤
1、业务角度
思路:查询监控调用链数据,查看 JVM 进行 FullGc 前期是否进行了某个请求,是否有单个请求触发
结果:事实证明并没有发现那个时刻的前多长时间内有不正常的请求触发。请求响应时间都很正常。
2、日志角度
思路:查看 GC 日志
结果:果然发现了一些问题。查看以下 GC 日志。(其实在查看日志前面也做了好多操作,比如 dump 堆内存去对比多次堆里面类的数量和内存的大小,但是没找到结果啊。)
关键词:G1 Humongous Allocation、Evacuation Failure。这些错误会导致 fullgc.

2020-03-16T14:02:23.133+0800: 12148.221: [GC pause (G1 Humongous Allocation) (young) (initial-mark), 0.0185544 secs]
   [Parallel Time: 16.7 ms, GC Workers: 2]
      [GC Worker Start (ms): Min: 12148221.6, Avg: 12148221.6, Max: 12148221.6, Diff: 0.0]
      [Ext Root Scanning (ms): Min: 5.1, Avg: 5.1, Max: 5.1, Diff: 0.0, Sum: 10.1]
      [Update RS (ms): Min: 0.7, Avg: 0.7, Max: 0.8, Diff: 0.0, Sum: 1.5]
         [Processed Buffers: Min: 13, Avg: 14.5, Max: 16, Diff: 3, Sum: 29]
      [Scan RS (ms): Min: 0.2, Avg: 0.2, Max: 0.2, Diff: 0.0, Sum: 0.4]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [Object Copy (ms): Min: 10.1, Avg: 10.1, Max: 10.2, Diff: 0.1, Sum: 20.3]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 2]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [GC Worker Total (ms): Min: 16.2, Avg: 16.2, Max: 16.2, Diff: 0.0, Sum: 32.4]
      [GC Worker End (ms): Min: 12148237.8, Avg: 12148237.8, Max: 12148237.8, Diff: 0.0]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.3 ms]
   [Other: 1.5 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.4 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.1 ms]
      [Humongous Register: 0.0 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.4 ms]
   [Eden: 834.0M(2448.0M)->0.0B(2446.0M) Survivors: 8192.0K->10.0M Heap: 1691.9M(4096.0M)->858.7M(4096.0M)]
 [Times: user=0.03 sys=0.00, real=0.02 secs] 
2020-03-16T14:02:23.152+0800: 12148.239: [GC concurrent-root-region-scan-start]
2020-03-16T14:02:23.166+0800: 12148.253: [GC concurrent-root-region-scan-end, 0.0136292 secs]
2020-03-16T14:02:23.166+0800: 12148.253: [GC concurrent-mark-start]
2020-03-16T14:02:23.484+0800: 12148.572: [GC concurrent-mark-end, 0.3187825 secs]

问题原因
有了以上错误,就可以查原因了啊。什么原因呢?看看以下解释吧。
大型对象(Humongous )是大于 G1 中 region 大小 50%的对象。频繁大型对象分配会导致性能问题。如果 region 里面包含大量的大型对象,则该 region 中最后一个具有巨型对象的区域与区域末端之间的空间将不会使用。如果有多个这样的大型对象,这个未使用的空间可能导致堆碎片化。

具体可查看这个文章
https://www.infoq.com/articles/tuning-tips-G1-GC/

This information is helpful, since you not only can tell how many humongous allocations were made by your application (and whether they were excessive or not), but you can also tell the size of the allocation. Moreover, if you deem that there were excessive humongous allocations, all you have to do is increase the G1 region size to fit the H-objs as regular ones. Therefore, for example, the allocation size is just above 4M bytes. Hence, to make this allocation a regular allocation, we need a region size of 16M. So, the recommendation here was to set that explicitly on the command line: -XX:G1HeapRegionSize=16M

G1 regions can span from 1MB to 32MB in powers of two, and the allocation request is slightly over 4 MB. Hence, an 8MB region size is not quite large enough to avoid the humongous allocations. We need to go to the next power of 2, 16 MB.

解决办法:增大 rigion 区的大小,FullGc 现象消失了


↙↙↙阅读原文可查看相关链接,并与作者交流