Blog

Stop opening ChatGPT. Start talking to your screen.

2026 has given us extraordinary AI agents. Claude Code. Manus. Genspark. OpenClaw. Hermes. They can reason, use tools, and solve problems that felt impossible just a year ago. But almost all of them live in the same three places: a command line, a chat window, or a browser tab. Each one asks you to stop what you are doing, go to a special place, and describe your world to an agent that cannot see it.

This post is about what happens when you stop making users walk to the agent, and bring the agent to the user instead. Voice is the front door. Ambient is the form. We call it an ambient voice agent, meaning one that lives on your screen next to whatever you're already doing, instead of in a separate window you have to open. It's the bet we're making with Cue.

The idea came from two moments, two months apart.

Two moments that changed my mind

Three months ago, I quit my job. I was running AI monetization at a top internet company, a role most people would hold onto. I left because I had seen something that made every other bet feel small. One person, one laptop, one voice. A factory of one. A company of one. I wanted to find out if that feeling was real, or just the high of a good weekend project. This post is about what I found.

Three months later, the numbers say the bet was real. 1,002 commits. 51 days. Three quarters of them written with an AI agent riding shotgun. I have never coded alone, and I have never typed faster. A single person, a handful of Claude sessions, and a voice. That is the factory.

For context: the last time I shipped production code, Obama was president.

It started with a post from Andrej Karpathy about the surprising efficiency of voice coding. I was curious. I plugged an early build of Cue into Claude Code and started talking instead of typing. The first thing I noticed wasn't speed. It was flow.

I was in the kind of deep focus I hadn't felt since my first year of programming. The keyboard had always been a latency layer between my thoughts and the screen. Voice just removed it. I could think, speak, and see the code appear, without the mechanical translation of fingers on keys.

Two weeks later, I was talking to a friend. She's a UX designer at a well-known AI agent company, and she told me she was designing the interaction patterns for their next product entirely by voice. She was using Google Stitch on her Mac and iPad, speaking her design decisions into the AI while she cooked dinner. Her hands were covered in flour. She never touched a keyboard.

At the end of the evening, the app screens were done. She was using one voice-AI product to design another voice-AI product, without thinking about it.

That was the moment I stopped thinking of voice as an engineer's shortcut. Dictation speed was never the point. The point was where the agent shows up.

She'd never open a terminal to do that. But almost every other agent in 2026 would force her to.

Where 2026's agents actually live

Take a quick look at the landscape. Claude Code lives in your terminal. Cursor lives in its own IDE. Manus and Genspark live in a browser tab. The powerful open-source loops like OpenClaw and Hermes are toolkits you embed into your own UI, which for most developers, is another terminal or web app.

The pattern is clear. Almost all of them require the user to stop what they are doing and enter a dedicated environment. You leave your document, you open a terminal. You leave your design file, you open a browser tab. You context-switch to the agent's world.

This makes sense. The CLI and the browser are the native habitats of developers, and developers were the first users of these powerful new tools. But developers are not the default user anymore. They are the early case.

An agent that lives in a terminal is an agent for engineers. Fine for now. Not where this goes.

The obvious reaction is: fine, so let's replace the terminal with voice. But that misses the actual shift.

Voice isn't the opposite of CLI. It's the opposite of a learning curve.

We are not anti-CLI. We use the terminal every day. The point is not that the CLI is bad. The point is that the CLI has a learning curve, and voice does not.

You already know how to talk. There is no special syntax to learn, no flags to remember, no man page to consult. You state your intent directly.

This is why voice is already the default input on our most personal devices. We talk to our phones with Siri, our homes with Alexa, and our cars with CarPlay. The desktop is the last holdout, and only because the people who built the desktop built it for themselves.

Voice is the input layer with the shortest distance between what you want and what gets done.

But voice alone is just dictation. What turns voice into an agent is what happens after the words stop.

Ambient: the agent comes to you

An ambient voice agent is a voice-triggered agent that lives on your screen, not in a separate window. "Ambient" because it's just there, in the background, the way a lamp is in the room. You don't go to it. It might be a floating bar at the top of your display, or a dynamic panel that appears only when you need it. It is a companion, not a destination.

The key mechanism is how it gathers context. Cue automatically reads five layers of context the moment you speak: your voice, any selected text, a screenshot of your active window, the accessibility attributes of the UI element you're focused on, and the name of the active app. The user never has to describe their context. The agent already sees it.

This is the difference between a standalone agent and a symbiotic one. A standalone agent, like Claude Code in a terminal, is separate from your work. You have to feed it context. A symbiotic agent is already in your context.

My friend in the kitchen doesn't open a terminal. She doesn't need to. Stitch knows she's in a design file because the design file is what's on screen. Voice says what. The screen says where. The agent does the rest.

Your agent doesn't need a better brain. It needs a better front door.

That sounds clean in theory. In practice, ambient works or fails on one thing: whether the agent can actually see what you're pointing at. Let me show you with Microsoft Office.

The Office case: why describing context kills agents

Imagine you are in an Excel spreadsheet. You select a cell containing a product description in French. You want to tell an agent: "Translate this, format it as a bulleted list, and summarize it in one sentence." This is a simple, multi-step task.

For an agent in a chat window, this is impossible without a painful back-and-forth. The agent has to ask: "Which cell? Which file? Which sheet? Do you mean cell B3 in 'Q2_Projections.xlsx' on Sheet1?" Nobody wants to talk like a spreadsheet coordinate system. This friction is what kills most voice agents before they even get started.

The ambient approach is different. When you speak, Cue is already looking. On macOS, it uses the Accessibility API to read the selected text, the window title, and the focused input attributes. It knows you're in Excel, it knows the filename, and it knows the content of the cell you selected. On Windows, the equivalent APIs are narrower, so we fall back to the clipboard when necessary. In browser-based apps like Google Sheets, we can use JavaScript injection to read the DOM structure directly. A screenshot with vision is the final fallback.

To be honest about the engineering, this is not a solved problem. To reliably identify a specific cell in Office for complex actions, you sometimes need plugin-level JavaScript injection. Without that, you fall back to the clipboard, which can be lossy. We built Cue to always use the highest-fidelity context channel available, and to degrade gracefully when it isn't.

The agent's intelligence is bottlenecked by what it can see. Vision isn't a feature here. It's the floor everything else stands on.

So you have two problems, not one. The input problem (voice) and the output problem (context and execution). They sit on different axes, and almost nobody solves both.

Voice is hard. Agents are hard. They are hard in different ways.

The challenge of building a useful agent can be mapped onto two axes: input and output.

The input axis is about understanding the user's request. This is the domain of voice. The challenges are noise, accents, speed, and accuracy. We use Deepgram's Nova-2 in the cloud for primary transcription, with local Whisper as an option. We then run a polish pass through Google's Gemini Flash, correcting punctuation and casing to match the destination app. We're also experimenting with Gemma for lower latency, aiming to push more of this on-device.

The output axis is about executing the user's request. This is the domain of the agent. The challenges are permissions, context acquisition, and environment-specific quirks. The Microsoft Office example is a perfect illustration. The browser sandbox, macOS's TCC permissions, and Windows UIA all have their own rules and limitations.

These two problems are rarely mastered by the same team. That's why some companies are world-class at input and do almost nothing with output. It's why others are world-class at output but barely touch the input layer. The middle, where voice turns into action across any app, is where Cue sits.

Everyone is wrapping models. Cue wraps harder.

If our value isn't the model, and isn't the voice engine, and isn't the agent loop, what is it?

Wrap harder: the philosophy

We wrap harder than anyone else. That's the philosophy.

"Wrapping" has become a dismissive term, but it's the entire game. The value isn't in the model. It's in the quality of the connection between the model and the real world. Wrapping isn't the insult people think it is. The wrapper is where the user actually lives.

We have immense respect for the work on agent loops and tool use by projects like OpenClaw and Hermes. They are building the right kind of brain. Cue doesn't try to rebuild it. We build the last mile: the interface between that brain and whatever the user is trying to do right now. That means using the permissions the user grants us fully, pulling the richest context we can from the active scene, and showing up in a form that feels like a companion instead of a destination.

Our core belief is that code, used as a medium for chain-of-thought reasoning and tool use, can solve almost anything. The question is not whether the brain works. It's whether the user can get to it without leaving what they're doing.

This is why we're not scared of better models. Every time Claude or Gemini ships something smarter, Cue gets sharper without us writing a line. Our value isn't the reasoning. It's the entry and exit points for that reasoning.

The deeper the model goes, the stronger Cue gets.

And because Cue is always there, it can learn. Over time, it will see the patterns you repeat, the apps you live in, the kinds of tasks you hand off. That accumulation of personal context is the asset we're building next. The base models don't have it. Cue will.

Which brings me back to my friend in the kitchen.

Back to the kitchen

It's the end of the evening. Dinner is on the table.

The app design is done. She never opened a terminal. She never touched a keyboard. She just talked.

And the app she was designing? Another one of these on-screen companions. A different team, a different product, the same bet. We're all building toward the same future from different directions. The agents that win are the ones that come to the user.

That future is not built in a terminal. It's built on your screen, next to whatever you're already doing.


They hear what you said. Cue sees what you're doing. And does the thing, in any app.

Ready to try

Cue your AI.

Voice-activated AI that lives on your screen, next to whatever you're already doing.

Download for Mac & Windows