性能测试持续学习 记一次应用 FullGc 问题解决过程

会飞的猪 · 2020年04月24日 · 最后由 会飞的猪 回复于 2020年04月26日 · 3982 次阅读

问题现象
应用的 QPS 很低,低到什么程度呢。 QPM 为个位数,但是应用确每隔 7 个小时左右进行一次 FullGc,虽然 FullGc 的频率不是很高,但也是不能接受的。
当时机器的一些关键配置为:
机器:16G 内存
JVM 设置:-Xmx8G
垃圾收集器:G1
分析步骤
1、业务角度
思路:查询监控调用链数据,查看 JVM 进行 FullGc 前期是否进行了某个请求,是否有单个请求触发
结果:事实证明并没有发现那个时刻的前多长时间内有不正常的请求触发。请求响应时间都很正常。
2、日志角度
思路:查看 GC 日志
结果:果然发现了一些问题。查看以下 GC 日志。(其实在查看日志前面也做了好多操作,比如 dump 堆内存去对比多次堆里面类的数量和内存的大小,但是没找到结果啊。)
关键词:G1 Humongous Allocation、Evacuation Failure。这些错误会导致 fullgc.

2020-03-16T14:02:23.133+0800: 12148.221: [GC pause (G1 Humongous Allocation) (young) (initial-mark), 0.0185544 secs]
   [Parallel Time: 16.7 ms, GC Workers: 2]
      [GC Worker Start (ms): Min: 12148221.6, Avg: 12148221.6, Max: 12148221.6, Diff: 0.0]
      [Ext Root Scanning (ms): Min: 5.1, Avg: 5.1, Max: 5.1, Diff: 0.0, Sum: 10.1]
      [Update RS (ms): Min: 0.7, Avg: 0.7, Max: 0.8, Diff: 0.0, Sum: 1.5]
         [Processed Buffers: Min: 13, Avg: 14.5, Max: 16, Diff: 3, Sum: 29]
      [Scan RS (ms): Min: 0.2, Avg: 0.2, Max: 0.2, Diff: 0.0, Sum: 0.4]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [Object Copy (ms): Min: 10.1, Avg: 10.1, Max: 10.2, Diff: 0.1, Sum: 20.3]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 2]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [GC Worker Total (ms): Min: 16.2, Avg: 16.2, Max: 16.2, Diff: 0.0, Sum: 32.4]
      [GC Worker End (ms): Min: 12148237.8, Avg: 12148237.8, Max: 12148237.8, Diff: 0.0]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.3 ms]
   [Other: 1.5 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.4 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.1 ms]
      [Humongous Register: 0.0 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.4 ms]
   [Eden: 834.0M(2448.0M)->0.0B(2446.0M) Survivors: 8192.0K->10.0M Heap: 1691.9M(4096.0M)->858.7M(4096.0M)]
 [Times: user=0.03 sys=0.00, real=0.02 secs] 
2020-03-16T14:02:23.152+0800: 12148.239: [GC concurrent-root-region-scan-start]
2020-03-16T14:02:23.166+0800: 12148.253: [GC concurrent-root-region-scan-end, 0.0136292 secs]
2020-03-16T14:02:23.166+0800: 12148.253: [GC concurrent-mark-start]
2020-03-16T14:02:23.484+0800: 12148.572: [GC concurrent-mark-end, 0.3187825 secs]

问题原因
有了以上错误,就可以查原因了啊。什么原因呢?看看以下解释吧。
大型对象(Humongous )是大于 G1 中 region 大小 50%的对象。频繁大型对象分配会导致性能问题。如果 region 里面包含大量的大型对象,则该 region 中最后一个具有巨型对象的区域与区域末端之间的空间将不会使用。如果有多个这样的大型对象,这个未使用的空间可能导致堆碎片化。

具体可查看这个文章
https://www.infoq.com/articles/tuning-tips-G1-GC/

This information is helpful, since you not only can tell how many humongous allocations were made by your application (and whether they were excessive or not), but you can also tell the size of the allocation. Moreover, if you deem that there were excessive humongous allocations, all you have to do is increase the G1 region size to fit the H-objs as regular ones. Therefore, for example, the allocation size is just above 4M bytes. Hence, to make this allocation a regular allocation, we need a region size of 16M. So, the recommendation here was to set that explicitly on the command line: -XX:G1HeapRegionSize=16M

G1 regions can span from 1MB to 32MB in powers of two, and the allocation request is slightly over 4 MB. Hence, an 8MB region size is not quite large enough to avoid the humongous allocations. We need to go to the next power of 2, 16 MB.

解决办法:增大 rigion 区的大小,FullGc 现象消失了

如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!
共收到 2 条回复 时间 点赞

7 个小时才一次 fullgc 都不能接受?
另外你 tps 低没看出来跟 fullgc 有啥关系,按你的说法,fullgc 7 小时才触发一次,那这 7 小时之间没有 fullgc tps 为啥也低。。。

Bob 回复

因为我们目前应用刚上线处于灰发阶段还未放量,目前的 QPM 为个位数。

目前我们解决的问题是 为啥在那么低的请求量的情况下,出现相对频繁的 FullGc 的问题。以防后续大批量的流量出现频繁 Fullgc 的情况。

需要 登录 后方可回复, 如果你还没有账号请点击这里 注册