Google 官方提供了 2 种方案，第一种可以利用 DDMS 或 ADM 手动抓取，第二张就是代码插桩行啦，Google 将第二种更多的是去精确的计算某个特定逻辑流程或函数耗时。
插桩 API：

public static void startMethodTracing () Added in API level 1
public static void startMethodTracing (String traceName, int bufferSize, int flags) Added in API level 1
public static void startMethodTracing (String traceName, int bufferSize) Added in API level 1
public static void startMethodTracingSampling (String traceName, int bufferSize, int intervalUs) Added in API level 21

在实践中发现startMethodTracing()会使得被插 app 变成卡的 1B，经查阅官方文档大致总结如下：
大致意思是如果使用startMethodTracing() 抓取 trace 会对 app 有一定性能影响，不能将得到的 trace 分析耗时当作实际结果，建议当做参考值，在今后的迭代中可以进行周期性的比对，从而分析 code 的较之前的耗时为增高或降低，在 Android4.4 及以后加入了startMethodTracingSampling()，其实是 API21 之后加入，即 Andoid5.0，不知是否是 Google 文档的错误，使用这个方法你可以定义采样间隔，单位是微秒，经实践，这个方法确实对性能的影响几乎感知不到，或者说几乎没有影响，但这也仍然建议标为参考值.

原文如下：
When your application calls startMethodTracing(), the system creates a file called .trace. This contains the binary method trace data and a mapping table with thread and method names.

The system then begins buffering the generated trace data, until your application calls stopMethodTracing(), at which time it writes the buffered data to the output file. If the system reaches the maximum buffer size before you callstopMethodTracing(), the system stops tracing and sends a notification to the console.

Interpreted code runs more slowly when profiling is enabled. Don't try to generate absolute timings from the profiler results (such as, "function X takes 2.5 seconds to run"). The times are only useful in relation to other profile output, so you can see if changes have made the code faster or slower relative to a previous profiling run.

In Android 4.4 and later, you can use sample-based profiling to profile with less runtime performance impact. To enable sample profiling, call startMethodTracingSampling() with a specified sampling interval. The system will then gather samples periodically until tracing is stopped via stopMethodTracing().

实践部分：
我们希望可以拿到所有方法的耗时，这一点实现很简单，但我们同时也希望对场景下所对应触发的方法进行做对应关系，即，一个场景 —> 可能涉及多个 activity ，一个 Activity —> 可能触发多个方法，那么我们就将 Activity 作为方法分类的最小单位，因此问题就变为，我们要抓取以 activity 为单位的方法集合耗时，即以 activity 为单位去记录 trace 文件。

考虑对后续插撞的可维护及扩展性考虑，我起了一个项目去开发 sdk jar，供开发今后调用，
其中封装了调用 TraceView 的启动结束方法，在项目中所有的基 Activity(例如：各种 BaseActivity) 的onCreate()和onPause()中分别加入启动和结束的代码桩，trace 文件以 activity 名命名。

插桩代码:

@Override
protected void onCreate(Bundle savedInstanceState) {
Log.d("5858", "startBaseFragmentActivity");
TraceViewController.startMethodTracingSampling(getApplicationContext(), 8000000, 10);
super.onCreate(savedInstanceState);
mBaseActivityWrapper = new BaseActivityWrapper(this);
mBaseActivityWrapper.onCreate(savedInstanceState);
}
@Override
protected void onPause() {
Log.d("5858", "stopBaseFragmentActivity");
TraceViewController.stopMethodTracing();
super.onPause();
isPause = true;
mBaseActivityWrapper.onPause();
}

为什么选择这两个位置插撞？:
例如对场景：ActivityA（简称 a）-> ActivityB（简称 b）的跳转分析如下：
a 展现：
a.onCreate() -> a.onStart() -> a.onResume()
a 跳转至 b 展现：
a.onPause() -> b.onCreate() -> b.onStart() -> b.onResume() -> a.onStop()
从上述 a activity 的展现过程可看出以onCreate()开始以onResume()结束，但是考虑到异步函数可能在onResume()前还未完成，不一定会统计的到，所以我们会将结束点往后移至onPause()，但这在极端条件下也会有可能统计不到异步函数的操作，至少我们是可控的，我们知道onPause()是在 a 跳转后立即执行，那么我们完全可以在 a 多等一会儿或者通过某种校验方式去完成等待，这样就会避免异步函数统计的丢失.
注意：当 a 跳转到 b 后，a.trace 完成收集，b.trace 还在记录中，如果你还需要记录 b.trace 文件，可以 back 到 a，或者点击 Home 键，总之离开当前页触发 onPaue() 即可
这样我们最终得到的对应关系: 业务场景 ActivityA 跳转至 ActivityB => "ActivityA -> a.trace -> a.methods" && "ActivityB -> b.trace -> b.methods"

这样我们就得到了每个 activity 的 trace 文件

对最终 trace 文件分析重点把握以下 3 点：
1.将 0（toplevel）下的 Incl Cpu Time 可当作 activity 加载整体耗时 (a.onCreate() -> a.onStart() -> a.onResume())
2.过滤对应 app 包名下的 method，将 Cpu Time/Call 当作对应的方法平均耗时参考
3.用于抓取 trace 的应用包最好去除 Proguard，即混淆，这样省去了解混淆的过程！

分析方法：
第一种可以通过命令行traceview xxx.trace人肉筛选记录重点数据
第二张自动分析，在这提供 2 中思路：
a. 利用dmtracedump xxx.trace，记录输出的结果数据流，自行写程序分析。
b. 改造traceview工具直接调取 api 进行分析，较第一种工作量较小。实现后可通过命令java -jar traceview.jar xxx.trace pkgname过滤输出所需内容。
Traceview 部分修改后代码，非完整版：，在启动添加了过来包名的命令行参数pkgFilterName, null 为抓取全部，指定后则过滤对于包的函数

        reader.getTraceUnits().setTimeScale(TraceUnits.TimeScale.MilliSeconds);
MethodData[] mds = reader.getMethods();
for(MethodData m : mds){
    if(pkgFilterName == null){
        System.out.println("=========");
        System.out.println("Index:" + m.getRank());
        System.out.println("全名:" + m.getName());
        System.out.println("类名:" + m.getClassName());
        System.out.println("函数名:" + m.getMethodName());
        System.out.println("函数参数:" + m.getSignature());
        System.out.println("总参考耗时:" + m.getElapsedInclusiveCpuTime());
        System.out.println("总调度次数:" + m.getTotalCalls());
    }else{
        if (m.getClassName().contains(pkgFilterName)){
            System.out.println("=========");
            System.out.println("Index:" + m.getRank());
            System.out.println("全名:" + m.getName());
            System.out.println("类名:" + m.getClassName());
            System.out.println("函数名:" + m.getMethodName());
            System.out.println("函数参数:" + m.getSignature());
            System.out.println("总参考耗时:" + m.getElapsedInclusiveCpuTime());
            System.out.println("总调度次数:" + m.getTotalCalls());
            BigDecimal a = new BigDecimal(m.getElapsedInclusiveCpuTime());
            BigDecimal b = new BigDecimal(m.getTotalCalls());
            System.out.println("平均参考耗时:" + a.divide(b,2,BigDecimal.ROUND_HALF_UP).doubleValue());
        }
    }
}

改造后部分结果如下（切记为参考值）：

=========
Index:1876
全名:com.wuba.update.HomeHtmlConnection.deleteHistoryTemplate ()Z
类名:com.wuba.update.HomeHtmlConnection
函数名:deleteHistoryTemplate
函数参数:()Z
总参考耗时:153
总调度次数:1
平均参考耗时:153.0
=========
Index:1877
全名:com.wuba.utils.DeviceUtils.<clinit> ()V
类名:com.wuba.utils.DeviceUtils
函数名:<clinit>
函数参数:()V
总参考耗时:153
总调度次数:1
平均参考耗时:153.0
=========
Index:1900
全名:com.wuba.service.UserAccountService.<clinit> ()V
类名:com.wuba.service.UserAccountService
函数名:<clinit>
函数参数:()V
总参考耗时:152
总调度次数:1
平均参考耗时:152.0

以上方案并非最终实现，我们的最终目标为：抓取对应数据，存放到数据库，这样在今后的迭代中可进行周期性比对，从而体现出函数优化情况

↙↙↙阅读原文可查看相关链接，并与作者交流