Build a Text-to-Speech App in 5 Minutes
Generate natural TTS audio in under 5 lines of Python via the OpenAI SDK pointed at SpeakEasy. 6 voices, streaming, HD mode, and 5 formats at $0.20/hour.
Want to add voice output to your application? With the SpeakEasy text to speech API, you can generate natural-sounding audio from text in just a few lines of code. This guide walks you through voice selection, API calls, streaming, and timestamps.
TL;DR
- Install
openai, setbase_url="https://www.tryspeakeasy.io/api/v1", callclient.audio.speech.create(model="tts-1", voice="nova", input="..."). That's a working TTS integration. - 6 built-in voices (
alloy,echo,fable,onyx,nova,shimmer) — same IDs as OpenAI's TTS API, so swapping providers takes one line. - Pricing: $0.20/hour of generated audio (≈66,667 characters). HD quality via
model="tts-1-hd"; streaming, 5 output formats (mp3/opus/aac/flac/wav), and word-level timestamps all included.
Prerequisites
- A SpeakEasy API key (sign up — $1 first month)
- Python 3.8+ or curl
Which voice should I pick?
SpeakEasy offers a range of voices optimized for different use cases. The voice IDs are identical to OpenAI's TTS API, so any sample clips you've heard from OpenAI's nova or fable sound the same here. Here's how we think about each:
| Voice | Style | Best For |
|---|---|---|
alloy | Neutral, balanced | General purpose, documentation |
echo | Warm, conversational | Podcasts, narration |
fable | Expressive, dynamic | Storytelling, audiobooks |
onyx | Deep, authoritative | Presentations, announcements |
nova | Friendly, upbeat | Customer-facing apps, assistants |
shimmer | Calm, clear | Meditation, accessibility |
Browse all available voices in the text-to-speech documentation. In practice we default to nova for product UIs (friendly but not sugary), echo for long-form narration, and onyx for onboarding videos where authority matters.
How do I generate speech?
Python (OpenAI SDK)
from openai import OpenAI
from pathlib import Path
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://www.tryspeakeasy.io/api/v1",
)
response = client.audio.speech.create(
model="tts-1",
voice="nova",
input="Welcome to SpeakEasy. Let's build something great together.",
)
Path("output.mp3").write_bytes(response.content)
print("Audio saved to output.mp3")
curl
curl -X POST https://www.tryspeakeasy.io/api/v1/audio/speech \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"voice": "nova",
"input": "Welcome to SpeakEasy. Let us build something great together."
}' \
--output output.mp3
Both approaches produce an MP3 file ready to play. That's your text-to-speech integration done.
How do I stream audio?
For real-time applications like voice assistants or live narration, stream the audio instead of waiting for the complete file:
response = client.audio.speech.create(
model="tts-1",
voice="nova",
input="This audio is being streamed in real time.",
)
# Stream to file in chunks
with open("streamed.mp3", "wb") as f:
for chunk in response.iter_bytes(chunk_size=4096):
f.write(chunk)
Streaming lets you start playback while the rest of the audio is still being generated, dramatically reducing perceived latency.
When should I use HD quality?
For production content where audio quality matters — audiobooks, podcasts, or marketing material — use the HD model:
response = client.audio.speech.create(
model="tts-1-hd",
voice="fable",
input="Premium audio quality for content that matters.",
)
The tts-1-hd model produces higher fidelity output with more natural prosody. It takes slightly longer to generate but the difference in quality is noticeable.
How do I get word-level timestamps?
Need to sync text highlighting with audio playback? Request word-level timestamps:
curl -X POST https://www.tryspeakeasy.io/api/v1/audio/speech \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"voice": "alloy",
"input": "Each word gets a precise timestamp.",
"response_format": "json"
}'
The response includes timing data for each word, enabling karaoke-style text highlighting, subtitle synchronization, and accessible reading tools.
Which output format should I use?
SpeakEasy supports five audio formats. Each strikes a different balance between file size, quality, and decoder compatibility — summarized by the underlying codec characteristics (RFC 6716 for Opus, IETF FLAC spec for FLAC). Specify the one your application needs:
response = client.audio.speech.create(
model="tts-1",
voice="echo",
input="Choose the format that fits your use case.",
response_format="opus", # Options: mp3, opus, aac, flac, wav
)
- mp3 — Best compatibility, good quality. Default.
- opus — Smallest file size, ideal for streaming and web.
- flac — Lossless, best for archival.
- wav — Uncompressed, best for further audio processing.
- aac — Good balance, preferred on Apple platforms.
Example: Batch Processing
Generate audio for multiple text blocks, such as turning a blog into a podcast:
from pathlib import Path
sections = [
"Introduction. Welcome to today's episode.",
"Chapter one. We begin with the fundamentals.",
"Conclusion. Thanks for listening.",
]
for i, text in enumerate(sections):
response = client.audio.speech.create(
model="tts-1-hd",
voice="echo",
input=text,
)
Path(f"section_{i}.mp3").write_bytes(response.content)
print(f"Generated section {i}")
What's Next?
You now have everything you need to add text-to-speech to your application. Whether you're building a voice assistant, generating audiobooks, or adding accessibility features, SpeakEasy makes it straightforward.
Explore the full text-to-speech documentation for all options, or check out our speech-to-text API to complete the loop with transcription.
Frequently Asked Questions
How much does text-to-speech cost on SpeakEasy?
SpeakEasy bills TTS at $0.20 per hour of generated audio — equivalent to roughly 66,667 characters — with 50 hours included in the $10/month plan and $1 for the first month. Both tts-1 and tts-1-hd are billed at the same rate, so HD quality carries no price premium.
Can I use the OpenAI Python SDK for SpeakEasy TTS?
Yes. SpeakEasy is a drop-in replacement for OpenAI's TTS endpoint. Point the OpenAI SDK at https://www.tryspeakeasy.io/api/v1, use the same voice IDs (alloy, echo, fable, onyx, nova, shimmer) and models (tts-1, tts-1-hd), and every existing integration works without code changes.
What's the difference between tts-1 and tts-1-hd?
tts-1 is optimized for low latency — useful for voice agents and real-time applications where first-byte latency matters more than audio polish. tts-1-hd produces higher-fidelity output with more natural prosody at the cost of slightly longer generation time. For audiobooks, podcasts, and marketing material, prefer tts-1-hd.
Which audio format is smallest?
Opus produces the smallest files at equivalent perceived quality — RFC 6716 defines its hybrid SILK/CELT codec, which is the reason Opus outperforms MP3 at low bitrates. For streaming to web clients (HTML5 <audio> via .ogg or .webm containers), Opus is the right default. For universal compatibility across email clients and older devices, use MP3.
Does SpeakEasy TTS support streaming?
Yes. Use response.iter_bytes(chunk_size=4096) on any TTS response to stream audio chunks as they're generated. This lets you start playback before the full file is synthesized, reducing perceived latency — critical for voice agents and live narration. All voices and both models support streaming at the same $0.20/hour rate.
Related reading
- How to Build an AI Voice Agent with a Speech-to-Text API — wire TTS into a full conversational pipeline
- Open Source Text-to-Speech in 2026 — how hosted TTS compares to Kokoro, Chatterbox, and Fish Audio
- Python Speech-to-Text API — close the loop with transcription
Get your API key — $1 first month, 50 hours included — and start generating speech in minutes.