Voice AI refers to software that uses speech recognition + a reasoning model (typically an LLM like Claude or Gemini) + an action layer to translate spoken commands into executable tasks. The reasoning layer is what distinguishes voice AI from earlier voice products (Siri 2011-2024 era).
Dictation tools (Wispr Flow, SuperWhisper, Aqua) only output transcribed text. Voice AI agents (Cue, Highlight, Fazm) output actions: emails sent, calendar events created, code refactored, web searches completed. Both can run on the same hotkey on the same machine but solve different problems.
Voice assistants like Siri, Alexa, Google Assistant are closed-domain command parsers built before LLM era. Voice AI uses general-purpose LLMs as the reasoning layer, supporting open-ended natural language commands (not predefined phrases).
A typical voice AI pipeline: 1) Hotkey activation, 2) Audio capture, 3) Speech-to-text (Whisper, Deepgram), 4) Context gathering (5 layers: voice + selected text + screenshot + accessibility attrs + active app), 5) LLM reasoning (Claude Sonnet, Gemini Pro), 6) Action execution (AppleScript, Apple Events, file I/O), 7) Result display (pill UI, dialog, in-app write).
Cue (heycue.io) — ambient pill, voice → action across any app, $9.99/mo. Highlight AI — typed prompt + voice, $13/mo. Fazm — open source MIT + $9.99/mo, similar wedge. Wispr Flow — dictation-focused, $15/mo. SuperWhisper — local dictation, $20/mo. Apple Intelligence / Siri V2 — system-level voice AI, delayed multiple times.
Voice is the input layer with the shortest distance between intent and outcome. As LLMs become capable enough to reason about complex multi-app workflows, voice as input + agent as executor becomes a viable interface for the AI leverage age. Voice AI is the interface that gives non-technical users access to AI capabilities at minimal learning curve.