How to Build an AI Voice Agent with a Speech-to-Text API
Voice agents need real-time STT, an LLM, and TTS working together. Here's how the architecture works, what to watch out for, and how to build one without overpaying.
Short answer: A production voice agent is four components wired together — voice activity detection, speech-to-text, an LLM, and text-to-speech — with end-to-end latency under 800 ms. Keep each hop cheap and fast; STT is usually the biggest cost line, which is why picking the right API matters.
Voice agents are everywhere right now. Customer support bots that actually understand you. Sales assistants that book meetings over the phone. Personal AI that responds in your voice.
Honestly, the hype is justified for once. Voice is the most natural way humans communicate — and we finally have the tech stack to make it work in production.
But here's the thing: most voice agent tutorials focus on no-code platforms or vendor-locked SDKs. That's fine for prototyping. For production? You need to understand what's actually happening under the hood.
This guide breaks down the architecture of a voice agent, the role speech-to-text plays in making it work, and how to build one that doesn't cost a fortune to run.
The Voice Agent Architecture
Every voice agent follows the same basic pipeline:
- Voice Activity Detection (VAD) — detect when the user starts and stops talking
- Speech-to-Text (STT) — convert the audio to text
- Language Model (LLM) — generate a response
- Text-to-Speech (TTS) — convert the response back to audio
That's it. Four components, wired together. The magic isn't in any single piece — it's in how fast you can move data through the pipeline.
[Microphone] → VAD → STT → LLM → TTS → [Speaker]
↑ |
└───── conversation loop ───┘
There's a second approach — end-to-end models like OpenAI's Realtime API that take audio in and produce audio out in a single step. Lower latency in theory, but you lose control. No custom prompts between steps, no choosing your own STT provider, no swapping models. For most production use cases, the orchestrated pipeline wins.
Step 1: Voice Activity Detection
VAD answers one question: "Is the user done talking?"
Get this wrong and your agent either interrupts the user mid-sentence or waits awkwardly for seconds after they've stopped. Both kill the experience.
Most voice agents use Silero VAD — it's open source, lightweight, and runs on CPU. It works by analyzing audio chunks and classifying them as speech or silence.
Key parameters to tune:
- Silence duration threshold — how long to wait after the last detected speech before triggering (typically 300–800ms)
- Speech probability threshold — how confident the model needs to be that it's hearing speech (0.5 is a good starting point)
- Min speech duration — ignore ultra-short sounds like coughs or "um"
# Basic VAD flow
import silero_vad
vad = silero_vad.load()
for audio_chunk in microphone_stream:
speech_prob = vad(audio_chunk)
if speech_prob > 0.5:
buffer.append(audio_chunk)
last_speech_time = now()
elif now() - last_speech_time > SILENCE_THRESHOLD:
# User is done talking — send buffer to STT
process_utterance(buffer)
buffer.clear()
Step 2: Speech-to-Text
This is where most voice agents succeed or fail. Your STT engine needs three things:
- Accuracy — garbage in, garbage out. If the transcription is wrong, the LLM responds to the wrong thing.
- Speed — every millisecond of transcription latency adds to the response time. Users notice anything over 1.5 seconds total round-trip.
- Streaming support — batch transcription is useless for real-time conversation. You need partial results as the user speaks.
The Two STT Approaches
Cloud APIs are the practical choice for most teams. You send audio, you get text back. No GPU provisioning, no model management, no scaling headaches. The tradeoff is latency — you're adding a network round trip.
Local/self-hosted models (like Whisper or Voxtral) eliminate network latency but require GPU infrastructure. Great for privacy-sensitive applications, but you're now managing ML infra instead of building your product.
For most developers building voice agents, a cloud STT API is the right call. The question is which one and how much it costs.
What to Look For in a Speech-to-Text API
| Feature | Why It Matters |
|---|---|
| Streaming/WebSocket | Get partial transcripts as user speaks — critical for low-latency agents |
| Per-second billing | Short utterances dominate voice agents. Per-minute billing wastes 80%+ of what you pay |
| Low latency | Under 300ms for streaming results. Your LLM and TTS add their own latency on top |
| Language support | If your users speak multiple languages, your STT needs to handle that |
| Word-level timestamps | Useful for barge-in detection — knowing exactly when the user started speaking over the agent |
SpeakEasy charges $0.20/hour with per-second billing. For voice agents, where the average utterance is 3–8 seconds, per-second billing is the difference between a reasonable bill and an absurd one. A 5-second utterance on a per-minute provider costs you 60 seconds. On SpeakEasy, it costs you 5 seconds. That's a 12x difference.
Step 3: The Language Model
Once you have the user's text, you feed it to an LLM. This is the brain of your agent.
For voice agents, model selection is about latency first, quality second:
- GPT-4o-mini / Claude Haiku — fast, cheap, good enough for most conversational tasks
- GPT-4o / Claude Sonnet — higher quality, slightly more latency
- Llama / Mistral (self-hosted) — ultimate control over latency and cost, but you're running inference infrastructure
Prompt Architecture for Voice Agents
Voice agent prompts need structure. A single blob of instructions falls apart in multi-turn conversations. Break it into sections:
IDENTITY: You are [name], a [role] for [company].
STYLE: Speak conversationally. Keep responses under 2 sentences.
CONSTRAINTS: Never make up information. If unsure, say so.
CONTEXT: [Dynamic context injected per conversation]
TOOLS: [Function definitions for booking, lookups, etc.]
The "keep responses under 2 sentences" part is crucial. LLMs default to verbose responses. In text chat, that's fine. In voice, a 200-word response takes 60+ seconds to speak. Nobody wants that.
Streaming LLM Responses
Don't wait for the full LLM response before starting TTS. Stream the response token by token and start converting to speech as soon as you have a complete sentence or clause. This can cut perceived latency by 2–3 seconds.
async for chunk in llm.stream(prompt):
sentence_buffer += chunk
if ends_with_sentence(sentence_buffer):
await tts.speak(sentence_buffer)
sentence_buffer = ""
Step 4: Text-to-Speech
TTS converts your LLM's response back to audio. The landscape here has gotten dramatically better in the last year.
Cloud TTS APIs (SpeakEasy, ElevenLabs, Deepgram) give you realistic voices with minimal setup. Most support streaming — send text in, get audio chunks back.
Open source models (Piper, Coqui, XTTS) run locally and give you full control. Quality varies, but Piper is surprisingly good for English.
For voice agents, the key metrics are:
- First byte latency — how fast does audio start playing after you send text?
- Voice quality — robotic voices break immersion instantly
- Streaming — can you start playing audio before the full response is synthesized?
SpeakEasy's TTS API supports streaming responses, starts at $0.20/hour (same rate as STT), and you can swap voices without changing your integration.
Putting It All Together
Here's the complete flow for a voice agent using SpeakEasy for STT and TTS:
import asyncio
from speakeasy import SpeakEasyClient
from openai import OpenAI
se = SpeakEasyClient(api_key="sk-se-...")
llm = OpenAI()
async def handle_conversation():
while True:
# 1. Capture audio until silence detected
audio = await capture_with_vad()
# 2. Transcribe with SpeakEasy STT
transcript = await se.transcribe(audio, stream=True)
# 3. Generate response with LLM
response = await llm.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": AGENT_PROMPT},
{"role": "user", "content": transcript.text}
],
stream=True
)
# 4. Stream response through TTS
async for chunk in response:
text = chunk.choices[0].delta.content
if text:
audio = await se.speak(text, voice="alloy")
await play_audio(audio)
The Cost Reality Check
Here's what a voice agent costs per hour of conversation, assuming roughly 50% of the time is user speaking (STT) and 50% is agent responding (TTS):
| Provider | STT Cost/hr | TTS Cost/hr | Total/hr |
|---|---|---|---|
| SpeakEasy | $0.10 | $0.10 | $0.20 |
| OpenAI Whisper + TTS | $0.36 | $0.90 | $1.26 |
| Deepgram + ElevenLabs | $0.25 | $0.30 | $0.55 |
| AssemblyAI + ElevenLabs | $0.37 | $0.30 | $0.67 |
At scale, these differences compound. A contact center handling 10,000 hours of voice agent conversations per month is looking at $2,000/month with SpeakEasy vs $12,600 with OpenAI. That's not a rounding error — that's a headcount.
And remember: with per-second billing, you're only paying for actual audio. The pauses between turns? Free. The silence while the LLM thinks? Free. With per-minute providers, you're paying for dead air.
What Can Go Wrong
Building voice agents is deceptively simple in concept and genuinely hard in practice. Watch out for:
Latency creep. Each component adds latency. VAD (50ms) + STT (200ms) + LLM (500ms) + TTS (200ms) = almost a full second before the user hears anything. Multiply that by network hops and you're at 1.5–2 seconds. Users start noticing at 1 second.
Barge-in handling. When a user starts speaking while the agent is still talking, you need to stop TTS playback immediately and start processing the new input. This requires coordinating between your audio output and VAD detection.
Background noise. VAD models can be tricked by music, TV, traffic, and keyboard typing. Aggressive filtering means you might miss quiet speakers. Tune your thresholds per environment.
Accents and terminology. STT accuracy drops with heavy accents, industry jargon, and proper nouns. Pick an STT provider that lets you send custom vocabulary or context hints.
Getting Started
The fastest path from zero to working voice agent:
- Pick your STT/TTS provider. SpeakEasy gives you both in one API with per-second billing.
- Set up VAD. Silero VAD is free and works well. Start with 500ms silence threshold.
- Wire up the LLM. GPT-4o-mini or Claude Haiku for speed. Keep system prompts short and structured.
- Stream everything. STT streaming, LLM streaming, TTS streaming. Every non-streaming step adds latency.
- Test with real users. Synthetic tests miss the weird things real humans do — talking over the agent, changing topics mid-sentence, going silent for 10 seconds then continuing.
Voice agents aren't the future anymore. They're the present. The tooling is mature, the costs are dropping, and users expect it. The only question is whether you'll build one that works — or one that works and doesn't cost a fortune.
Related reading
- Python Speech-to-Text API: Transcribe Audio in 5 Lines of Code — wire up the STT step in your voice agent
- Open Source Text-to-Speech in 2026 — self-hosted TTS options for the TTS step
- The Real Cost of Speech-to-Text APIs — why per-second billing matters so much for voice agents