AI测试 对话 TalktoApps 创始人:Voice AI 提高了我五倍的生产力,语音输入是人机交互的未来

RTE开发者社区 · 2025年02月07日 · 70 次阅读

那些正在做 Voice AI Agent 产品的 builder 都碰到了哪些实际问题?他们又是如何思考和解决的?

今天推荐的文章来自 Vela 新录制的一期播客的整理,对话语音 APP TalktoApps 的创始人 Ebaad。Ebaad 分享了诸多在开发 voice first 产品时碰到的挑战和思考,语音界面和图形界面如何结合?何时何地采用什么样的人机交互更为合适?产品背后的技术架构又该如何设计和演化?听听他们的对话,期待对你有所启发。

如果你已经认识我,可能会知道我是一个 AI 产品和语音 AI 爱好者(也在努力成为一名创造者)。作为一个音频和音乐爱好者,我认为任何类型的声音都很美妙。不仅仅是音乐,还有来自山谷的声音和城市中的对话声。

我相信语音 AI 能让产品以更自然的方式与用户互动,可以做到像人与人之间的对话一样自然。它也可以应用到不同的场景中,为真正重视这一点的用户带来价值。所以我决定开始这个播客系列"Voice Talk",来连接更多的语音 AI 开发者,并开源我们早期的语音 AI 经验。

在每一期节目中,我都会邀请一位在硅谷正在开发很酷的语音产品的创始人,我们会谈论他们与语音的个人故事,构建语音产品的历程,以及语音产品的未来。

第一集录制请到了我在 Founders.inc 的一个好朋友同时也是一个语音产品的爱好者-Ebaad,在创建一款语音 APP TalktoApps。除了关于语音技术和语音产品的历程,我们还聊了聊语言、声音、和语音未来的人际交互。

是同作为语音产品创业者的同频对话。

对话原文为英文,中文翻译如下。

视频全文见文末。

一、我与声音的故事

Vela: Hello, everyone. I'm Vela.
Vela:大家好,我是 Vela。

Ebaad: I'm Ebaad
Ebaad:我是 Ebaad。

Vela: welcome, Ebaad
Vela:欢迎,Ebaad。

Ebaad: thank you for having me.
Ebaad:谢谢邀请我。

Vela: You are the first one I met at Founders Inc. who are doing something with voice as well.
Vela:你是我在 Founders Inc.遇到的第一个也在做语音相关项目的人。

Ebaad: Likewise.
Ebaad:我也是。

Vela: Amazing. And I think you are also an audio person. Like you would prefer collecting information with audio.
Vela:太棒了。我觉得你也是一个音频爱好者。就像你更喜欢用音频来收集信息。

Ebaad: Yeah, I prefer that in some cases.
Ebaad:是的,在某些情况下我确实更喜欢这样。

Vela: Cool. Let's talk more about that. Can you tell us about your story with voice and which role voice has played in your life?
Vela:很酷。让我们多谈谈这个。你能告诉我们你和语音的故事,以及语音在你生活中扮演了什么角色吗?

Ebaad: Oh, voice has been really, really amazing. It's 5x my productivity. Maybe 10x it as well. In the past maybe year or two years, I've listened to 5 million words. I use this app called Speechify.
Ebaad:哦,语音真的非常棒。它让我的生产力提高了 5 倍,可能甚至 10 倍。在过去一两年里,我听了 500 万字。我使用一个叫 Speechify 的应用。

And I recently came across this other app called WhisperFlow, which I've dictated a hundred thousand words in two months. I used to use the dictate function on Apple, but it was not that good. But WhisperFlow makes it very easy.
最近我发现了另一个叫 WhisperFlow 的应用,我用它在两个月内就口述了十万字。我以前用苹果的口述功能,但效果不是很好。而 WhisperFlow 使用起来非常方便。

So yeah, it's a big part of how I have my workflows. From coding, so I just talk to cursor as well.
所以是的,这是我工作流程中的重要组成部分。从编码来说,我也会对着光标说话。

I also basically talk to an LLM. I just blab on for like 30 40 seconds about different things and give it as much context. And it just does that, and then I use Speedify to listen to the answer as I'm walking around or doing something.
我基本上也会和 LLM 对话。我会随意谈论不同的事情大约 30 到 40 秒,给它提供尽可能多的上下文。它就会处理这些内容,然后我用 Speedify 在走动或做其他事情时听取回答。

I like to walk around, so you can't read while walking around, so I'll just run the LLM and be walking around the room. And then hear it through my headphones.
我喜欢走来走去,你不能边走边读,所以我就运行 LLM,在房间里走动。然后通过耳机听取内容。

Vela: So you like talking to apps?
Vela:所以你喜欢和应用对话?

Ebaad: I like talk to apps, yes.
Ebaad:是的,我喜欢和应用对话。

Vela: So here comes through your product.
Vela:那这就说到你的产品了。

语音技术让我的生产力提高了 5 倍,可能甚至 10 倍。我基本上也会和 LLM 对话。我会随意谈论不同的事情大约 30 到 40 秒,给它提供尽可能多的上下文。它就会处理这些内容,然后我用 Speedify 在走动或做其他事情时听取回答。

二、产品介绍:Talk to App

Ebaad: That's a nice transition. I'm building this thing called talktoapps.com. It's kind of like what it is.
Ebaad:这个转折很好。我正在开发一个叫 talktoapps.com 的东西。就像它的名字一样。

Basically you can interact with your favorite apps using natural language. It could be text or it could be voice as well. So both.
基本上你可以用自然语言与你喜欢的应用进行交互。可以是文字,也可以是语音。两者都可以。

And basically what you can do is instead of clicking a hundred times you can just say things that are very abstract, like remove all my meetings on Wednesday instead of going and clicking and everything.
基本上你可以做的是,不用点击上百次,你可以说一些很抽象的话,比如"删除我周三的所有会议",而不是去点来点去。

And it will understand that or make a meeting and assign it to Vela or invite Vela or assign it to this person instead of going and finding them.
它会理解这些,或者创建一个会议并分配给 Vela,或邀请 Vela,或分配给某个人,而不是去找他们。

With clicks. So yeah, so currently it has Todoist, but I'm gonna be integrating today, Google Calendar. And then Google Sheets, I was testing out yesterday.
通过点击。所以是的,目前它有 Todoist,但我今天要整合 Google 日历。然后是 Google 表格,我昨天在测试。

It looks, it looks pretty exciting, like you can do things just from an extension, and talk to the extension. Considering WhatsApp as well, like, it could be cross platform using WhatsApp, so you can just basically speak in a WhatsApp bot.
看起来非常令人兴奋,比如你可以仅通过一个扩展程序来完成事情,并与扩展程序对话。也在考虑 WhatsApp,比如它可以通过 WhatsApp 跨平台使用,所以你基本上可以对着 WhatsApp 机器人说话。

And then it will just do those things. So you don't have to worry about it being on your computer.
然后它就会去完成这些事情。所以你不用担心它是否在你的电脑上。

Yeah, I think the barrier to entry in terms of making a task or having the idea in your head for it to basically go on to the computer or your app, I think it's more of a design problem.
是的,我认为在创建任务或将你脑中的想法传达到电脑或应用方面的门槛,我认为这更多是一个设计问题。

And I think the, the, the, the barrier to entry should be very little. So in the future, I would like to have it as a bracelet, where you can just tap it, and then it does that. But that's a long way from now.
我认为准入门槛应该很低。所以在未来,我希望把它做成一个手环,你只需轻轻一点,它就能完成操作。但那还需要很长时间。

三、和用户的语音 AI 故事

Vela: The basic point is, you want to change your interaction with machine as natural as the interaction with humans.
Vela:基本要点是,你想要让人机交互变得像人与人之间的交互一样自然。

Ebaad: Possibly, yes. Cool.
Ebaad:可能是的,没错。很酷。

Vela: Can you share us your favorite user story so far?
Vela:你能分享一下目前最喜欢的用户故事吗?

Ebaad: I I have a favorite user story. I think it's this guy, Hadza. And he uses WhisperFlow as well. And he built a nuclear reactor by talking to Claude.
Ebaad:我确实有一个最喜欢的用户故事。我想是这个叫 Hadza 的人。他也使用 WhisperFlow。 他通过与 Claude 对话构建了一个核反应堆。

So that's what he was doing. He was playing with these tools and stuff and then he would be talking to Claude as it is. Cause you can do space and FN and then you can keep talking to Whisper as you're doing things.
这就是他在做的事。他在玩这些工具之类的,然后他就这样和 Claude 对话。因为你可以按空格和 FN 键,然后你就可以在做事的同时继续对 Whisper 说话。

So yeah, I think that's pretty good. Like, you can just keep talking and give it the context of what you're seeing. Maybe you have this, and then you can just talk to it, and then kind of take the input back output back, and then use it to what you want to do.
所以是的,我觉得这很好。就像,你可以继续说话,给它提供你所看到的上下文。也许你有这个,然后你可以跟它说话,然后某种程度上获取输入输出的反馈,然后用它来做你想做的事。

四、创建语音 AI 产品的技术挑战

Vela: Let's dig into the technical side. Which challenge you have been faced with when building talk to apps.
Vela:让我们深入技术层面。在构建 talk to apps 时,你面临过哪些挑战?

Ebaad: That's an interesting question. I think there's two main problems, and I'll go over them chronologically.
Ebaad:这是个有趣的问题。我认为有两个主要问题,我会按时间顺序讲述。

The first one is, as we talked about, it's the design of basically how you communicate with it. It could be natural lang it's gonna be natural language, but the first thing is like, can you type as well?
第一个是,就像我们讨论过的,基本上是 关于如何与它交流的设计。 它可能是自然语言,但第一个问题是,你能否也用打字?

Right? Cancel all my meetings on Wednesday. Sometimes that's necessary because if you're outside and things like that. People, what I've heard, people don't want to talk as much.
对吧?比如"取消我周三的所有会议"。有时这是必要的,因为如果你在外面之类的。据我所知,人们不想太多说话。

You're more careful talking out loud. Yeah, exactly. So typing could also be an interesting solution. But then it's voice first.
你在大声说话时会更谨慎。是的,确实如此。所以打字也可能是一个有趣的解决方案。但它还是以语音为主。

The second is how do you basically interact with the app? Like, how does it give you feedback? Like, there's a term in UI, like it has to give feedback.
第二个是你如何与应用进行基本交互?比如,它如何给你反馈? 就像 UI 中的一个术语,它必须给出反馈。

Normally Alexa and Siri, they just say things. Which I think is very limited sometimes because with voice you can only hear the things that are right in front of you or someone saying, but with graphic you can see the whole thing.
通常 Alexa 和 Siri,它们只是说话。我认为这有时候很受限,因为通过语音你只能听到当前正在说的内容,但通过图形界面你可以看到整体。

Vela: Yeah, like one dimension and multi dimensions.
Vela:是的,就像一维和多维。
Ebaad: Yeah, multi dimension, like, there's this quote like, a picture is worth a thousand words, so. Ebaad:是的,多维,就像那句话说的,一图胜千言。

If you can see things, how things are happening, that's a very interesting one as well, and then you want continuous feedback. Like, basically if you're doing multi step workflows, like an example would be like, Okay, could you get this Twitter link and research this person and put it into my investor doc for sheets?
如果你能看到事情是如何发生的,这也很有趣,然后你需要持续的反馈。比如,如果你在做多步工作流程,举个例子就像,好的,你能获取这个 Twitter 链接,研究这个人,然后把它放到我的投资者表格文档中吗?

So you basically need to kind of see that that text converts into something visual, where it's like, Okay, taking Twitter, it's an icon, and then it's putting it into an LLM, is kind of spinning, maybe.
所以你基本上需要看到文本转换成视觉效果的过程,比如,好的,获取 Twitter,显示一个图标,然后把它放入 LLM,可能会有一个旋转的动画。

And then basically, you can also drag and drop things if it screws up. Like, and then it changes the text based on how you do. There's this thing called Zapier where you can kind of tie multiple Functions together.
然后基本上,如果出错了你也可以拖放东西。然后它会根据你的操作改变文本。有一个叫 Zapier 的东西,你可以用它把多个功能连接在一起。

So it kind of could be like, it's also text, but you can also drag and drop if something messes up, or if it selects the wrong function.
所以它可能是这样的,它也是文本,但如果出错了或者选择了错误的功能,你也可以拖放。

That, and I think the last part of the design part is like, when you mess up. Like, maybe if you are trying to add investor in a doc, which is basically like for your friends or something, and there's no investor column or something.
还有,我认为设计部分的最后一点是,当你搞砸了的时候。比如,如果你试图在文档中添加投资者,而这个文档基本上是为你的朋友准备的或其他什么的,而且没有投资者栏目之类的。

How does it give you the feedback? In terms of, oh, you gave wrong commands, or it should just basically give you the right commands to run.
它如何给你反馈?比如说,噢,你给出了错误的命令,或者它应该直接给你正确的运行命令。

That kind of comes back to the technical side. It's like, you need to have context of their Excel or Google Docs, or their Notion page of what the structure is.
这就回到了技术层面。就像,你需要了解他们的 Excel 或 Google 文档,或者他们的 Notion 页面的结构是什么样的。

So I'm working on that on the technical side, like dry, creating like separate functions for everyone, everyone's Google Sheets. So it can basically know exactly what the natural language commands are missing.
所以我在技术方面正在研究这个,比如创建针对每个人、每个人的 Google 表格的独立功能。这样它就能准确知道自然语言命令缺少什么。

And can give a feedback, like maybe it's like a, it's like a sentence. And then it has like empty spaces. Oh, like you added the investor, but then there's a Twitter. Do you want to also add the Twitter? And I can research that for you. Or something like that.
并且可以给出反馈,可能就像一个句子。然后它有一些空白处。哦,比如你添加了投资者,但还有 Twitter。你想要也添加 Twitter 吗?我可以为你研究一下。或者类似这样的东西。

So it could be interesting. So that's the interaction product side. And I could, I'll be happy to talk more about the technical side of how the infrastructure, the LLM and state management will be.
嗯,这部分可能会很有意思。以上是关于互动产品方面的内容。我很乐意进一步讨论技术细节,包括基础设施、LLM 以及状态管理方面的设计。

Vela: Yeah, sure. But before that, I want to dig into the interaction, like the design side. How would you love to handle this problem? Like how to do the trade off or balance the interaction design between voice interaction and the graphic interaction?
Vela:好的,当然。但在那之前,我想深入探讨交互,比如设计方面。你想如何处理这个问题?比如如何在语音交互和图形交互之间做出权衡或平衡?

Ebaad: I, I'm, it depends on what problem you're solving. Like, if you have a screen in front of you, I think graphics is better.
Ebaad:这取决于你要解决什么问题。比如,如果你面前有一个屏幕,我觉得图形界面更好。

Because it has more bandwidth of information. And I think you can do a lot of interesting things of like, what you're showing at that time highlighted in a specific way.
因为它可以传递更多信息。而且我认为你可以做很多有趣的事情,比如在特定时刻以特定方式突出显示你展示的内容。

Things like that, but I think in terms of interaction Graphics is very very good Because you can just basically see all your calendar like you can't if it's like voice is telling you That this is your calendar you have meeting on at 3 you have meeting at 5 It it takes like maybe 10 20 seconds to kind of yeah, and then and you forget as well.
诸如此类的事情,但我认为就交互而言,图形界面非常好。因为你可以一眼看到你的整个日历,而如果是语音告诉你"这是你的日历,你 3 点有个会议,5 点有个会议",这可能需要 10-20 秒,而且你也可能会忘记。

So I think it's good for one way interaction And then the feedback should be graphical. 所以我认为 它适合单向交互,然后反馈应该是图形化的。

Does that make sense?
这说得通吗?

Vela: Interesting.
Vela:有意思。

Ebaad: Yeah, so I'm more focused. I, I, I think interaction right now, if you're talking to Siri or something like that, it's very computationally heavy. And then you're waiting for the answer. It's thinking.
Ebaad:是的,所以我更关注。我认为现在的交互,如果你在和 Siri 之类的对话,它在计算上非常重。然后你要等待答案。它在思考。

And then if you cut it off, it's like, you remember with, with Agenta? Like cutting it off and like, it's, it's like, yeah. So I think talking and giving instruction Voice is hands down the best, but giving feedback, I think graphical things are interesting.
然后如果你打断它,就像,你还记得 Agenta 吗?就像打断它然后,是的。所以我认为说话和给出指令,语音是绝对最好的,但给出反馈时,我认为图形化的东西很有趣。

But there's trade offs, of course, depending on where you are.
当然,这也有权衡,取决于你在什么场景。

Vela: Yeah, and it reminds me, probably it's your design principle for a talk to app. Like, your input is just the voice, but your output is the graphic, like, command.
Vela:是的,这让我想到,这可能是你对 talk to app 的设计原则。就像, 你的输入只是语音,但输出是图形化的,像命令一样。

Ebaad: Yes, and there are benefits to that. Like, you can talk to it on a phone as well. But could be, could be interesting. But I think I'll figure the design out more as I go across.
Ebaad:是的,这样做有好处。比如,你也可以在手机上和它对话。但可能会很有趣。但我想随着进展我会更多地理清设计思路。

Vela: Cool. Let's move on to the technical challenge.
Vela:很酷。让我们继续谈技术挑战。

Ebaad: I think the technical challenge is like, currently it's a very, very simple infrastructure, but it has to get really complicated as the complexity of the tasks increase.
Ebaad:我认为技术挑战就像,目前它是一个非常简单的基础架构,但随着任务复杂度的增加,它必须变得真的很复杂。

So currently it's basically, if you guys know, Agents are not that crazy. You just basically give a command and they convert them into objects. And those objects are normally JSON objects that you can run functions with.
所以目前基本上,如果你们知道的话,Agents 并不那么疯狂。你基本上只是给出一个命令,它们就把命令转换成对象。这些对象通常是 JSON 对象,你可以用它们来运行函数。

So if, if you're a to do task you basically create a task at three and it creates this JSON object, which has these parameters, so it could be like content, due date, that's it, right? 所以如果你是一个待办任务,你基本上是在三点创建一个任务,它会创建这个 JSON 对象,它有这些参数,可能是内容、截止日期,就这些,对吧?

And then it has this API where you can basically run that API on the backend. But complexity arises. It's easy with create tasks because there's no previous context you're dealing with, but updating is a little bit challenging because to update, you need to find what you're updating first.
然后它有这个 API,你基本上可以在后端运行那个 API。但复杂性就出现了。创建任务很容易,因为你不需要处理之前的上下文,但更新有点具有挑战性,因为要更新,你首先需要找到你要更新的内容。

So if, if you say update, get groceries at three to five, it has to first search the groceries. 所以如果你说更新,把三点买杂货改到五点,它首先必须搜索杂货。

And then these API is require you to have like a primary key where, There's a task ID, then you have to map the task ID to the update and then run it.
然后这些 API 要求你有一个主键,比如任务 ID,然后你必须将任务 ID 映射到更新,然后运行它。

And then there's also this challenge of storing states. Like if you basically say that update it and it updates it, and then you're like, you change your mind and you want to undo it.
然后还有存储状态的挑战。比如如果你说要更新它,它更新了,然后你改变主意想要撤销。

You have to store what it was before.
你必须存储它之前的状态。

Vela: a lot of function calls,
Vela:需要很多函数调用,

Ebaad: a lot of function calls, and a lot of state management as well. So you have to like be able to go back. Like maybe if you're, if you're on the fly, Oh, change it to five. Oh no, change it to six. Oh, you know what? Leave it at where it was.
Ebaad:是的,很多函数调用,还有很多状态管理。所以你必须能够回退。比如如果你在临时改变,"哦,改到五点"。"哦不,改到六点"。"哦,你知道吗?还是保持原样吧"。

Vela: Ah, it reminds me, you know, if I say that change to seven, or change to six, probably it can be some design stuff. Like, the point behind that is How can you identify what users are exactly saying, what they're missing behind that? Probably you need more time, like, until five seconds later, you need to do the translation after that.
Vela:啊,这让我想起,你知道,如果我说改到七点,或改到六点,这可能是一些设计问题。比如,背后的重点是如何识别用户到底在说什么,他们背后遗漏了什么?可能你需要更多时间,比如直到五秒钟后,你才需要在那之后进行转换。

Ebaad: Could be. It could be, could be, could be like that, like, maybe like after you've said the sentence and you've completed it, and then run it, that could be handled on the front end, like, when the sentence is complete, then run it.
Ebaad:可能是。可能是这样的,可能就像在你说完句子并完成后,然后运行它,这可以在前端处理,就像当句子完成时,再运行它。

But it could also be done where it displays it and then you can change it again. So I think both, so it's kind of like a thing of where you pre process it and pre process it.
但它也可以做成显示出来然后你可以再次改变它。所以我认为两种都可以,所以这有点像你在哪里预处理它和预处理它的问题。

Yeah. Because like if you say change my groceries at five, or like do groceries at five, it, that's a complete sentence. So it can just basically wait for you to change your mind, maybe one or two seconds.
是的。因为如果你说在五点改变我的杂货清单,或者在五点买杂货,那是一个完整的句子。所以它基本上可以等你改变主意,可能一两秒。

But then there's like this thing where you want to just do it immediately. Instead of waiting. So I think storing it in a state makes more sense. But you can do some pre processing on the front end before you make the function calls. It's a very interesting problem.
但是也有这种情况,你想立即执行它。而不是等待。所以我认为将它存储在状态中更有意义。但你可以在进行函数调用之前在前端做一些预处理。这是个很有意思的问题。

Vela: Very interesting. Can I say it's not only a technical issue, but it can also probably handle by some product design.
Vela:非常有趣。我可以说 这不仅是一个技术问题,也可能可以通过一些产品设计来处理。

Ebaad: Yes, it could be handled by product design as well. Initially, yeah. It could be interesting to see, but the thing is, it's a very new field, it's a new human computer interaction field, so a lot of people are going to take time on developing it, and the best design will rise to the top, it will bubble up.
Ebaad:是的,它也可以通过产品设计来处理。最初是的。看看会很有趣,但问题是,这是一个非常新的领域,是一个新的人机交互领域,所以很多人会花时间开发它,最好的设计会浮到顶端,会冒出来。

Vela: Yeah, also for the evaluation of your product and the voice interface, there is a brand new evaluation.
Vela:是的,而且对于你的产品和语音界面的评估,有一种全新的评估方式。

Ebaad: What does evaluation mean?
Ebaad:评估是什么意思?

Vela: Like the product evaluation.
Vela:就像产品评估。

Ebaad: Okay.
Ebaad:好的。

Vela: There comes to, for me, there comes to two parts. So first is the product eval. That you, how can you evaluate whether your product solved the user's problem.
Vela:对我来说,这涉及到两个部分。首先是产品评估。就是你,如何评估你的产品是否解决了用户的问题。

Vela: It's from the user side. And second is if you think from the technical side. How's the quality and latency and whatever.
Vela:这是从用户方面来说。第二是如果你从技术方面考虑。质量和延迟等等如何。

Yeah. You do have some different metrics of your conversation.
是的。你确实有一些不同的对话指标。

Ebaad: One hundred percent.
Ebaad:完全同意。

Vela: How do you evaluate them?
Vela:你如何评估它们?

Ebaad: Yeah, that's interesting. I think right now I don't have a very good framework. I basically see what's a good workflow.
Ebaad:是的,这很有趣。我想现在我还没有一个很好的框架。我基本上是看什么是好的工作流程。

I could do more user testing, but I recently implemented this thing called Groq. Mm hmm. And their latency is very low. So on the transcript it can do under 300 milliseconds for a sentence, meaning that it's like 0.3 seconds.
我可以做更多的用户测试,但我最近实施了一个叫 Groq 的东西。嗯嗯。它们的延迟非常低。在转录方面,一个句子可以在 300 毫秒内完成,也就是说大约 0.3 秒。

So it's kind of real time. Yeah. And then creating JSON objects and running the API, it's like 0.8 seconds, so it kind of is real time.
所以它有点像实时的。是的。然后创建 JSON 对象和运行 API,大约需要 0.8 秒,所以某种程度上是实时的。

So you basically say add groceries at three and then it would do it, and then you can see the visual and then, oh, you know what, change that to five.
所以你基本上说"在三点添加杂货"然后它就会执行,然后你可以看到视觉效果,然后,哦,你知道吗,把它改到五点。

So I think the latency with Groq has changed a lot of things. Because previously with OpenAI it took two seconds. And then it's kind of not that intuitive.
所以我认为使用 Groq 后的延迟改变了很多事情。因为之前用 OpenAI 需要两秒。然后就不那么直观了。

So I think you can just basically see the workflow and be intuitive enough to be saying, oh this is a good, it's good technology and then it's a good, also, it seeps into the design as well.
所以我认为你基本上可以看到工作流程,并且足够直观地说,哦这是好的,这是好的技术,然后这是好的,而且,它也渗透到设计中。

Because I think latency is very, very important in this case because you want to be able to change things on the fly.
因为我认为在这种情况下延迟非常非常重要,因为你想要能够随时更改事物。

There's this company called AquaVoice on YC that's doing a really good job with editing on the fly. I will show you after. It's, it's, it's, it's amazing like the way they do voice editing.
有一个在 YC 的公司叫 AquaVoice,他们在即时编辑方面做得非常好。我稍后会给你看。他们做语音编辑的方式真的很神奇。

Basically you can just speak. You just say, Hey, can you create three items? GPUs, computers, and things. And then you can like, can you add, can you convert it to a list?
基本上你可以直接说话。你只要说,嘿,你能创建三个项目吗?GPU、电脑和其他东西。然后你可以说,你能添加,你能把它转换成列表吗?

So it would just basically take a sentence and convert it to a list. And then you would be like, oh, can you add GPUs to the top instead of computers? And it would do that. And can you make 200 GPUs? Instead of like, just GPUs, and it would add 200 there.
所以它基本上会把一个句子转换成列表。然后你可以说,哦,你能把 GPU 放到顶部而不是电脑吗?它就会这样做。然后你说能把它改成 200 个 GPU 吗?而不是仅仅 GPU,它就会在那里添加 200。

As you understand, like, it's like, it's very interesting. We'll take a look after this.
你明白的,就像,这很有趣。我们待会儿看看。

Yeah, and I think it could be done with coding as well. Like, maybe you're just running a function, and, oh, could you take line 150 to 160, and kind of add two if statements in there, or add one argument, or move this to the function below.
是的,我认为这也可以用于编码。比如,也许你只是在运行一个函数,然后说,哦,你能把第 150 行到 160 行,在那里添加两个 if 语句,或者添加一个参数,或者把这个移到下面的函数。

So you can just basically edit code, and run code. With the voice.
所以你基本上可以用语音编辑代码,运行代码。

I don't know where our conversation started, but I think it's kind of the evaluation side. I think you can basically see intuitively if it's a good product or not. I don't know, I don't know if that's taste or I don't know if that's just I think everyone can tell that if it's like, if it's a bad product, you can just basically see.
我不记得我们的对话从哪里开始的了,但我认为是关于评估方面。我认为你基本上可以凭直觉看出这是不是一个好产品。我不知道,我不知道这是品味还是什么,我认为每个人都能看出来,如果它是个糟糕的产品,你基本上一眼就能看出来。

It's like a movie. Like you see a bad movie and you're like, ah, it's a bad movie.
就像看电影。就像你看到一部烂片,你就会说,啊,这是部烂片。

Vela: Mm hmm. Yeah, there's two parts. Yeah. Okay. Yeah, we move very far away. And let's move to the last question.
Vela:嗯嗯。是的,有两个部分。好的。是的,我们偏离得很远了。让我们来看最后一个问题。

五、关于语音 AI 产品的未来

语言在其结构上非常灵活。所以我认为这就是未来,它对人类来说更直观。自然语言浏览器将会是一个重要的东西。自然语言 IDE。在 South Park Commons(硅谷一个著名孵化器)有一个公司,正在开发自然语言浏览器。

Vela:In the future, what kind of, like, how do you see the future of voice AI and voice based product?
Vela:在未来,你如何看待语音 AI 和基于语音的产品的未来?

Ebaad: Oh that's interesting. I think it'll depend on these two technologies that are coming in. I think it's taking two paths.
Ebaad:哦,这很有趣。我认为这将取决于正在出现的这两种技术。我认为它正在走两条路。

One is the computer vision side. Where you can just basically tell your phone to send a WhatsApp message to Vela. about this meeting or something. And it can just do that.
一个是计算机视觉方面。你基本上可以告诉你的手机发送 WhatsApp 消息给 Vela,关于这个会议或其他什么。它就能做到。

And then the design in terms of interaction will be very different because you're gonna see the AI do it on the screen or something like that. It's like just how you would do it.
然后在交互方面的设计会很不一样,因为你会看到 AI 在屏幕上执行操作或类似的事情。就像你自己会怎么做一样。

And the second one is like this one, where you're adding a layer of natural language on top of these functions. I'm more bullish on the second one. The latter one.
第二个就像这个,你在这些功能之上添加一层自然语言。我对第二个更看好。后面这个。

Because I think it's just faster because you're using CPU instead of GPU to make these API calls, and then you're just using natural language to infer what those API calls.
因为我认为它更快,因为你使用 CPU 而不是 GPU 来进行这些 API 调用,然后你只是使用自然语言来推断这些 API 调用。

And it's very like, it's very co like in terms of the cost. I know your question was like in terms of how the interaction will look like, but I wanna talk a little bit about the technology as well.
而且它很像,在成本方面很协同。我知道你的问题是关于交互会是什么样子,但我也想谈一谈技术方面。

Scanning images and finding where to click. It's very expensive. Mm-hmm. And then you have to do like one image per second or something like that.
扫描图像并找到点击位置。这非常昂贵。嗯嗯。然后你必须每秒处理一张图像之类的。

But with this, it's just very, very quick. And then you're using human to scan the page and things like that.
但用这个方法,它就非常非常快。然后你让人来扫描页面之类的。

In terms of the, the, the, the future, I think the future is going to be natural language.
就未来而言,我认为未来将是自然语言。

Because I think technology tries to optimize towards how efficiently you can do things. Like, not always, like QWERTY keyboards are not very effective, but those are other things.
因为我认为技术试图优化你做事的效率。比如,不是总是这样,就像 QWERTY 键盘并不是很有效,但那是其他方面的事情。

But, it basically optimizes for the thing that's gonna take the least effort. And I think if you're clicking and then you have to do ten touchpoints to do one task, and if you can do it in a sentence, the efficiency will demand that you move towards voice and no more natural.
但它基本上是为了优化需要最少努力的事情。我认为如果你要点击然后必须点十下才能完成一个任务,而如果你用一句话就能完成,效率会要求你转向语音和更自然的方式。

Like, you can do way more with voice in terms of communicating what you want to do than clicking. So, and it carries more information, like, it just carries more information, like, voice, like, natural language.
比如,在表达你想做什么方面,用语音可以做得比点击多得多。而且它携带更多信息,就像,它就是携带更多信息,比如语音,比如自然语言。

You don't have to specify a lot of things. Like, you can get, you can get a lot of context from the words around it as well.
你不需要指定很多东西。比如,你可以从周围的词中获得很多上下文。

Like, we're, we're recording an interview in a podcast room. This is not a job interview. It's a podcast. So it's probably a different type of interview.
比如,我们在播客室录制采访。这不是求职面试。这是播客。所以这可能是不同类型的采访。

So language in its structure is very flexible. So I think that's the future and it's more intuitive to humans.
所以语言在其结构上非常灵活。所以我认为这就是未来,它对人类来说更直观。

So I think natural language browser is going to be a big thing. Natural language IDE. There's a company in South Park Commons, I think that's working on natural language browser.
所以我认为 自然语言浏览器将会是一个重要的东西。自然语言 IDE。在 South Park Commons 有一个公司,我认为正在开发自然语言浏览器。

So basically you type in, open this page and go somewhere or something, and it will do that. Download this and then with code it's like run this, deploy this, things like that.
所以基本上你输入,打开这个页面然后去某个地方之类的,它就会执行。下载这个然后用代码就像运行这个,部署这个,诸如此类。

So I think natural language is going to be very huge, especially with LLMs and also the new tools everyone's building.
所以我认为自然语言会非常重要,特别是有了 LLM 以及每个人都在开发的新工具。

Vela: Here comes to the end. In the future, I would love to say that you know, currently what we are building is we leverage the voice AI to make the interaction between humans and the machine more natural as the interaction with humans.
Vela:我们要结束了。在未来,我想说的是,你知道,目前 我们正在构建的是利用语音 AI 使人机交互变得像人与人之间的交互一样自然。

In this way, we can have more time spending talking with humans.
这样,我们就能有更多时间与人交谈。

Ebaad: Yeah, and talking to machines as well.
Ebaad:是的,也可以和机器交谈。

Vela: And talking to machines as well. And let the agents talking to other agents.
Vela:也可以和机器交谈。并让代理之间相互交谈。

Ebaad: Yeah, that would be hilarious. There's two hinge agents talking to each other. There's like, yeah, agents talking to each other, yeah.
Ebaad:是的,那会很有趣。 有两个 Agents 互相交谈。就像,是的,代理之间互相交谈,是的。

Then there's gonna be other problems that are gonna come. I think most of them. I think there's two sides, the technical and the design side. And both of them interest me.
然后会出现其他问题。我认为大部分是这样。我认为有两个方面,技术方面和设计方面。这两个方面都让我感兴趣。

So, we'll see, I think the future looks I don't know if it's bright, but it looks different than It's definitely bright, I think. I don't want to be clicking.
所以,我们拭目以待,我认为未来看起来,我不知道是不是光明的,但看起来与现在不同,我认为肯定是光明的。我不想再点来点去了。

Vela: Yeah, very good vision. Very good point. Thank you for digging into the technical side and bringing us upon the future of the voice and interaction with machines.
Vela:是的,很好的愿景。很好的观点。感谢你深入探讨技术方面,带我们展望语音和人机交互的未来。

Ebaad: Thank you for having me, Vela.
Ebaad:谢谢你邀请我,Vela。

Vela: Thank you, Ebaad.
Vela:谢谢你,Ebaad。

Referenced Voice Product:

• Speechify: https://speechify.com/

• WhisperFlow: https://wisprflow.ai/

• AquaVoice: https://withaqua.com/

Where to find TalkToApps:

• Website:https://www.talktoapps.com/

视频原文:

关于 Vela 语音产品探索者
Builder & Creator
微信:la_vela(请加备注)

更多 Voice Agent 学习笔记:

2024,语音 AI 元年;2025,Voice Agent 即将爆发丨年度报告发布

对话谷歌 Project Astra 研究主管:打造通用 AI 助理,主动视频交互和全双工对话是未来重点

这家语音 AI 公司新融资 2700 万美元,并预测了 2025 年语音技术趋势

语音即入口:AI 语音交互如何重塑下一代智能应用

Gemini 2.0 来了,这些 Voice Agent 开发者早已开始探索……

帮助用户与 AI 实时练习口语,Speak 为何能估值 10 亿美元?丨 Voice Agent 学习笔记

市场规模超 60 亿美元,语音如何改变对话式 AI?

2024 语音模型前沿研究整理,Voice Agent 开发者必读

从开发者工具转型 AI 呼叫中心,这家 Voice Agent 公司已服务 100+ 客户

WebRTC 创建者刚加入了 OpenAI,他是如何思考语音 AI 的未来?

暂无回复。
需要 登录 后方可回复, 如果你还没有账号请点击这里 注册