AI测试 别担心,你还不会失业——AppAgent 简单试用

恒温 · 2023年12月24日 · 最后由 guichuan 回复于 2024年04月15日 · 12771 次阅读

每一次有自动化的新工具问世,就有一堆人会说啊呀呀呀,测试要失业了。几天前 AppAgent 出来时,嗅觉灵敏的自媒体就开始搬运,然后剑锋直指测试工程师,于是咱们又失业了一次。因为团队内部对应用自动化测试也有诉求,所以第一时间就在自己电脑上跑起来看看。

安装步骤

安装很简单,我用的是 Windows 11 64bit,android 环境已经装好(其实只要装了 adb 就可以了),python 环境也安装好了(我的 python 环境用的是 conda,大家可以自行百度)。然后把代码下载下来,pip install -r requirements.txt 安装好依赖就可以用了。

英语好的,直接看 https://github.com/mnotgod96/AppAgent

运行前配置

其实就是因为他用了 openAI 的 gpt-4-vision-preview 模型,所以咱们必须得有 openAI 的收费账户,然后拿到对应的 OPENAI_API_KEY。对应 AppAgent 的配置文件 config.yaml

...
OPENAI_API_BASE: "https://api.openai.com/v1/chat/completions"
OPENAI_API_KEY: "sk-xxxx"  # Set the value to sk-xxx if you host the openai interface for open llm model
OPENAI_API_MODEL: "gpt-4-vision-preview"  # The only OpenAI model by now that accepts visual input
...

这些参数会在 model.py 里调用,

ask_gpt4v 方法:这个方法是和 openAI 交互的方法

def ask_gpt4v(content):
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {configs['OPENAI_API_KEY']}"
    }
    payload = {
        "model": configs["OPENAI_API_MODEL"],
        "messages": [
            {
                "role": "system",
                "content": content
            }
        ],
        "temperature": configs["TEMPERATURE"],
        "max_tokens": configs["MAX_TOKENS"]
    }
    response = requests.post(configs["OPENAI_API_BASE"], headers=headers, json=payload)
    print_with_color("resp: ", response)
    if "error" not in response.json():
        usage = response.json()["usage"]
        prompt_tokens = usage["prompt_tokens"]
        completion_tokens = usage["completion_tokens"]
        print_with_color(f"Request cost is "
                         f"${'{0:.2f}'.format(prompt_tokens / 1000 * 0.01 + completion_tokens / 1000 * 0.03)}",
                         "yellow")
    return response.json()

从 openAI 回来的数据会在 parse_explore_rsp 里进行解析,我感觉这个方法是最重要的,它利用 openAI 的 Thought/Action/Action Input/Observation 机制,对结构化的返回进行解析。很多这种 agent 其实都是基于这个机制,openai 的这块做的比较好,每次都能按照这个模式来给你返回,所以目前来说插件体系啥的也只有 openai 的搞起来了(From 挺神)。这里也挺有意思的,本来我想 openAI 太贵,AppAgent 调用一次,0.02 刀的样子,想换成阿里云的通义千问,翻了一遍文档,似乎没有 Thought/Action/Action Input/Observation 机制,这个我不专业,有懂的同学可以指正下。

所以这里话又说回来了,你还得花这个 openAI 的钱,否则你得大改 APPAgent 的代码。

运行

运行很简单,按官方文档,先 learn 再 run。我这里拿 CSDN 做例子,先在手机上把 CSDN 打开,然后执行 python .\learn.py

这里我选 human demonstration,autonomous exploration 没时间跑。在终端输入 2,回车,就会进入下一步:

What is the name of the target app?

CSDN
Warning! No module named 'sounddevice'
Warning! No module named 'matplotlib'
Warning! No module named 'keras'
List of devices attached:
['42954ffb']

Device selected: 42954ffb

Screen resolution of 42954ffb: 1440x3216

这里会通过 adb 命令,把设备信息拿回来。APPAgent 里自己封装了 adb 命令,比如点击就是用的 adb shell input tap 坐标,比较原始(我一开始以为会封装个啥 Appium 之类的),在文件 and_controller.py 里。这些信息打印好之后,会立刻让你输入你后面动作的描述。这里我就写 “search for testerhome”,然后回车,就会弹出一个界面来。

Please state the goal of your following demo actions clearly, e.g. send a message to John

search for testerhome
(然后回车,就会弹出一个界面来,看英语说的,红色的是可以点击的,蓝色的是可以滚动的,看下面这个图。)All interactive elements on the screen are labeled with red and blue numeric tags. Elements labeled with red tags are clickable elements; elements labeled with blue tags are scrollable elements.

我们鼠标聚焦到图片之后,按回车,图片就会消失,接着提示我们就可以根据可以点击的地方,来操作,比如这里搜索的按钮是 25,那我就需要点击 25 这个元素。

Choose one of the following actions you want to perform on the current screen:
tap, text, long press, swipe, stop

tap
Which element do you want to tap? Choose a numeric tag from 1 to 83:

25

这个时候,点击就成功了,会再把点击搜索按钮之后的界面截图出来,

接下来都是一样的操作,总共 5 个步骤。

Which element do you want to tap? Choose a numeric tag from 1 to 14:

3
Choose one of the following actions you want to perform on the current screen:
tap, text, long press, swipe, stop

text
Which element do you want to input the text string? Choose a numeric tag from 1 to 14:

3
Enter your input text below:

testerhome
Choose one of the following actions you want to perform on the current screen:
tap, text, long press, swipe, stop

tap
Which element do you want to tap? Choose a numeric tag from 1 to 15:

4
Choose one of the following actions you want to perform on the current screen:
tap, text, long press, swipe, stop

stop
Demonstration phase completed. 5 steps were recorded.

然后就是 chatGPT 开始工作了,

Warning! No module named 'sounddevice'
Warning! No module named 'matplotlib'
Warning! No module named 'keras'
Starting to generate documentations for the app CSDN based on the demo demo_CSDN_2023-12-24_20-46-47

Waiting for GPT-4V to generate documentation for the element net.csdn.csdnplus.id_ll_order_tag_net.csdn.csdnplus.id_iv_home_bar_search_1

resp:

Request cost is $0.00

Documentation generated and saved to ./apps\CSDN\demo_docs\net.csdn.csdnplus.id_ll_order_tag_net.csdn.csdnplus.id_iv_home_bar_search_1.txt

Waiting for GPT-4V to generate documentation for the element android.widget.LinearLayout_1008_144_net.csdn.csdnplus.id_et_search_content_1

resp:

Request cost is $0.00

Documentation generated and saved to ./apps\CSDN\demo_docs\android.widget.LinearLayout_1008_144_net.csdn.csdnplus.id_et_search_content_1.txt

Waiting for GPT-4V to generate documentation for the element android.widget.LinearLayout_1008_144_net.csdn.csdnplus.id_et_search_content_1

resp:

Request cost is $0.00

Documentation generated and saved to ./apps\CSDN\demo_docs\android.widget.LinearLayout_1008_144_net.csdn.csdnplus.id_et_search_content_1.txt

Waiting for GPT-4V to generate documentation for the element android.widget.LinearLayout_1440_176_net.csdn.csdnplus.id_tv_search_search_2

resp:

Request cost is $0.00

Documentation generated and saved to ./apps\CSDN\demo_docs\android.widget.LinearLayout_1440_176_net.csdn.csdnplus.id_tv_search_search_2.txt

Documentation generation phase completed. 4 docs generated.

最后生成的样子是这样的:

其中 task_desc 就是我们前面的 search for testerhome,record 是每一步的命令的合并,然后有打标签的截图等等。

到这里,我们的学习就完成了,下面就要运行了, python run.py

Warning! No module named 'sounddevice'
Warning! No module named 'matplotlib'
Warning! No module named 'keras'
Welcome to the deployment phase of AppAgent!
Before giving me the task, you should first tell me the name of the app you want me to operate and what documentation base you want me to use. I will try my best to complete the task without your intervention. First, please enter the main interface of the app on your phone and provide the following information.

What is the name of the target app?

CSDN
Warning! No module named 'sounddevice'
Warning! No module named 'matplotlib'
Warning! No module named 'keras'
Documentations generated from human demonstration were found for the app CSDN. The doc base is selected automatically.

List of devices attached:
['42954ffb']

Device selected: 42954ffb

Screen resolution of 42954ffb: 1440x3216

Please enter the description of the task you want me to complete in a few sentences:

search for testerhome
Round 1

Documentations retrieved for the current interface:
Documentation of UI element labeled with the numeric tag '25':
This UI element is clickable. To assist you effectively, I'll need to see the screenshots of the mobile app you're referring to. Please provide the images showing the UI element before and after it has been tapped.



Thinking about what to do in the next step...

resp:

Request cost is $0.01

Observation:

The image shows a smartphone screen with a Google search bar at the top. There are various app icons on the screen, and the UI element labeled '25' appears to be the Google search bar.

Thought:

To complete the given task, I should tap on the Google search bar to bring up the keyboard and enter the search term "testerhome."

Action:

tap(25)

Summary:

I have tapped on the Google search bar to initiate a search for "testerhome."

Round 2

Documentations retrieved for the current interface:
Documentation of UI element labeled with the numeric tag '3':
This UI element is clickable. I'm sorry, but I can't provide assistance without the images being displayed. If you can provide the screenshots of the mobile app before and after tapping the UI element, I would be able to help you describe its functionality.

This UI element can receive text input. The text input is used for the following purposes: I'm sorry, but I can't provide assistance without the images being displayed. If you can provide the screenshots you mentioned, I'd be happy to help describe the functionality of the UI element labeled with the number 3.

Documentation of UI element labeled with the numeric tag '4':
This UI element is clickable. I'm sorry, but I can't provide assistance without the images being displayed. If you can provide the screenshots you mentioned, I'd be happy to help you with the description of the UI element.



Thinking about what to do in the next step...

resp:

Request cost is $0.01

Observation:

I'm sorry, but I cannot provide observations without an image being displayed. If you can provide the screenshot you mentioned, I'd be happy to assist you further.

Thought:

Since I cannot see the image, I am unable to determine the next step to complete the given task.

Action:

FINISH

Summary:

I have tapped on the Google search bar to initiate a search for "testerhome," but without the image, I cannot proceed further with the task.

Task completed successfully

这个过程,其实就是拿着前面 learn 的时候,记录的这些信息,去组成 prompt 模板,再去调用 chatGPT。代码是下图,里面的 image_url,就是打标签的图片。把某一步的操作和对应的图片提交给 GPT

我前面运行 run.py 里面第一步就成功的把图片和 tap 的操作给传给 chatGPT 了,GPT 说 tap(25) 。但是大家再往下看的时候,就发现 GPT 开始胡说八道了,所以很遗憾,我 learn 时候的操作,并没有在 run 的时候重放出来。

总结

至此,基本把 APPAgent 跑了一遍了,我和群友说,demo 很性感,现实很骨感,显然 chatGPT 对 CSDN 不够了解。在我看来,现阶段的 APPAgent 只不过是一个客户端录制回放的,而且非常简陋的工具。但是思路非常不错,我自己组里准备着手改造,看看能不能真正用起来。

如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!
共收到 17 条回复 时间 点赞

啥时候国内对 ChatGPT 有完全平替呀。。。

Ouroboros 回复

个人认为,啥时候国内有对 google 和 github 的完全平替,啥时候才能有 chatgpt 的完全平替

试用了一下,完全没有官方提供的 demo 那么神奇,丝滑。几乎是不可用状态,水分太大,原理其实就是组装 prompt,思路可以参考下。

PixelMatrixer 回复

是的。思路还是不错的。而且模型要对被测 app 比较了解。还是得炼丹。

别说测试内部了,这玩意儿研发都转发给我们,让我们调研一下😂

感觉更像是智慧版的 monkey 方向,立个专项,给云测配个执行端的 “大脑”,不过没给 bug 判定,去重提交,版本线的 bug 追踪配 “大脑”,流程上的决策跟踪,离托管还早

在朋友圈看到了 震撼、神奇、折服等词来形容这个工具。然后就去下载玩了下。和恒温的感觉差不多,离落地应用还远着。每个项目的 prompt 还得自己改,慢慢适应。

hi 我是 appagent 的作者之一 非常开心看到您在尝试使用 appagent。
不过我简单看了一下您的报错,貌似是在 gpt4-v 的 api 上出现了问题,仿佛 agent 压根没有看到屏幕内容一样。一般来说对于搜索这种简单的功能是十分容易的,网上也有一些别人使用的例子。 欢迎你在我们官方 repo 发起 issue,我们会尽量帮你解决。

icoz69 回复

欢迎 appagent 的同学,我当时执行的时候,也比较奇怪。事实上,learn 的时候,对应的截图和打标都已经有了。但是没有识别出来。时间比较少,就没有深入了。

icoz69 回复

😂 😭 买不起 4 呀,能用 gpt-3.5 的做替换吗😭

我们去年做了一个方案,当时还没有 gpt4v,我们是直接用的页面树给的 gpt,让 gpt 根据目标进行遍历 app;测试下来,通用性还行,从登录到翻页到点击滑动基本都 ok,而且行为逻辑也较为符合人的思维;
缺点不少,都是关于 gpt 的:
1、太贵 太贵 太贵,3.5 效果不佳。4 真的太贵
2、基于当时 8k 的 token 长度,哪怕我做了历史操作的循环限制,但用完之后,还是会降智(因为没有历史记录可以参考了),这个对能够遍历的页面数量和深度影响很大

目前来看,这个效果还是很不错的,传统 app 遍历要不是随机或者一定能力的定向,例如自己在关键页面去写脚本,比如在登录页,插登录脚本,才能进登录后的页面,但使用 gpt 遍历,完全不需要,能够识别大部分的用户场景,自主去判断如何进行下一步。

但 gpt 自身的限制,确实很影响落地的实际效果,也许再过几年,才能够有完全可用的效果。

昨天试了下 demo 成功了,跑的 google play,只是验证了安装 app 流程,后续再尝试下比较长的流程看下

恒温 #13 · 2023年12月30日 Author
jwentest 回复

换个国内的应用试试看

为了体验,忍痛升级了 gpt-4😂

原生的直接可以 dump 树,在把树根据特征转换成 tag 和操作位置,tag 比如有 input(输入框),操作位置就是输入框的 x2 和 y1 偏移 2-3px,如此如此这般这般,用树和界面结合理解,似乎不需要 gpt...

恒温 #16 · 2024年03月19日 Author

最新版本支持了通义千问,点赞~

如果支持 Web 端就好了

需要 登录 后方可回复, 如果你还没有账号请点击这里 注册