Build a Text-to-Speech App in 5 Minutes

Q: What's the difference between `tts-1` and `tts-1-hd`?

`tts-1` is optimized for low latency — useful for voice agents and real-time applications where first-byte latency matters more than audio polish. `tts-1-hd` produces higher-fidelity output with more natural prosody at the cost of slightly longer generation time. For audiobooks, podcasts, and marketing material, prefer `tts-1-hd`.

Q: Does SpeakEasy TTS support streaming?

Yes. Use `response.iter_bytes(chunk_size=4096)` on any TTS response to stream audio chunks as they're generated. This lets you start playback before the full file is synthesized, reducing perceived latency — critical for voice agents and live narration. All voices and both models support streaming at the same $0.20/hour rate.

Want to add voice output to your application? With the SpeakEasy text to speech API, you can generate natural-sounding audio from text in just a few lines of code. This guide walks you through voice selection, API calls, streaming, and timestamps.

TL;DR

Install openai, set base_url="https://www.tryspeakeasy.io/api/v1", call client.audio.speech.create(model="tts-1", voice="nova", input="..."). That's a working TTS integration.
6 built-in voices (alloy, echo, fable, onyx, nova, shimmer) — same IDs as OpenAI's TTS API, so swapping providers takes one line.
Pricing: $0.20/hour of generated audio (≈66,667 characters). HD quality via model="tts-1-hd"; streaming, 5 output formats (mp3/opus/aac/flac/wav), and word-level timestamps all included.

Prerequisites

A SpeakEasy API key (sign up — $1 first month)
Python 3.8+ or curl

Which voice should I pick?

SpeakEasy offers a range of voices optimized for different use cases. The voice IDs are identical to OpenAI's TTS API, so any sample clips you've heard from OpenAI's nova or fable sound the same here. Here's how we think about each:

Voice	Style	Best For
`alloy`	Neutral, balanced	General purpose, documentation
`echo`	Warm, conversational	Podcasts, narration
`fable`	Expressive, dynamic	Storytelling, audiobooks
`onyx`	Deep, authoritative	Presentations, announcements
`nova`	Friendly, upbeat	Customer-facing apps, assistants
`shimmer`	Calm, clear	Meditation, accessibility

Browse all available voices in the text-to-speech documentation. In practice we default to nova for product UIs (friendly but not sugary), echo for long-form narration, and onyx for onboarding videos where authority matters.

How do I generate speech?

Python (OpenAI SDK)

from openai import OpenAI
from pathlib import Path

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://www.tryspeakeasy.io/api/v1",
)

response = client.audio.speech.create(
    model="tts-1",
    voice="nova",
    input="Welcome to SpeakEasy. Let's build something great together.",
)

Path("output.mp3").write_bytes(response.content)
print("Audio saved to output.mp3")

curl

curl -X POST https://www.tryspeakeasy.io/api/v1/audio/speech \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "voice": "nova",
    "input": "Welcome to SpeakEasy. Let us build something great together."
  }' \
  --output output.mp3

Both approaches produce an MP3 file ready to play. That's your text-to-speech integration done.

How do I stream audio?

For real-time applications like voice assistants or live narration, stream the audio instead of waiting for the complete file:

response = client.audio.speech.create(
    model="tts-1",
    voice="nova",
    input="This audio is being streamed in real time.",
)

# Stream to file in chunks
with open("streamed.mp3", "wb") as f:
    for chunk in response.iter_bytes(chunk_size=4096):
        f.write(chunk)

Streaming lets you start playback while the rest of the audio is still being generated, dramatically reducing perceived latency.

When should I use HD quality?

For production content where audio quality matters — audiobooks, podcasts, or marketing material — use the HD model:

response = client.audio.speech.create(
    model="tts-1-hd",
    voice="fable",
    input="Premium audio quality for content that matters.",
)

The tts-1-hd model produces higher fidelity output with more natural prosody. It takes slightly longer to generate but the difference in quality is noticeable.

How do I get word-level timestamps?

Need to sync text highlighting with audio playback? Request word-level timestamps:

curl -X POST https://www.tryspeakeasy.io/api/v1/audio/speech \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "voice": "alloy",
    "input": "Each word gets a precise timestamp.",
    "response_format": "json"
  }'

The response includes timing data for each word, enabling karaoke-style text highlighting, subtitle synchronization, and accessible reading tools.

Which output format should I use?

SpeakEasy supports five audio formats. Each strikes a different balance between file size, quality, and decoder compatibility — summarized by the underlying codec characteristics (RFC 6716 for Opus, IETF FLAC spec for FLAC). Specify the one your application needs:

response = client.audio.speech.create(
    model="tts-1",
    voice="echo",
    input="Choose the format that fits your use case.",
    response_format="opus",  # Options: mp3, opus, aac, flac, wav
)

mp3 — Best compatibility, good quality. Default.
opus — Smallest file size, ideal for streaming and web.
flac — Lossless, best for archival.
wav — Uncompressed, best for further audio processing.
aac — Good balance, preferred on Apple platforms.

Example: Batch Processing

Generate audio for multiple text blocks, such as turning a blog into a podcast:

from pathlib import Path

sections = [
    "Introduction. Welcome to today's episode.",
    "Chapter one. We begin with the fundamentals.",
    "Conclusion. Thanks for listening.",
]

for i, text in enumerate(sections):
    response = client.audio.speech.create(
        model="tts-1-hd",
        voice="echo",
        input=text,
    )
    Path(f"section_{i}.mp3").write_bytes(response.content)
    print(f"Generated section {i}")

What's Next?

You now have everything you need to add text-to-speech to your application. Whether you're building a voice assistant, generating audiobooks, or adding accessibility features, SpeakEasy makes it straightforward.

Explore the full text-to-speech documentation for all options, or check out our speech-to-text API to complete the loop with transcription.

Frequently Asked Questions

How much does text-to-speech cost on SpeakEasy?

SpeakEasy bills TTS at $0.20 per hour of generated audio — equivalent to roughly 66,667 characters — with 50 hours included in the $10/month plan and $1 for the first month. Both tts-1 and tts-1-hd are billed at the same rate, so HD quality carries no price premium.

Can I use the OpenAI Python SDK for SpeakEasy TTS?

Yes. SpeakEasy is a drop-in replacement for OpenAI's TTS endpoint. Point the OpenAI SDK at https://www.tryspeakeasy.io/api/v1, use the same voice IDs (alloy, echo, fable, onyx, nova, shimmer) and models (tts-1, tts-1-hd), and every existing integration works without code changes.

What's the difference between `tts-1` and `tts-1-hd`?

tts-1 is optimized for low latency — useful for voice agents and real-time applications where first-byte latency matters more than audio polish. tts-1-hd produces higher-fidelity output with more natural prosody at the cost of slightly longer generation time. For audiobooks, podcasts, and marketing material, prefer tts-1-hd.

Which audio format is smallest?

Opus produces the smallest files at equivalent perceived quality — RFC 6716 defines its hybrid SILK/CELT codec, which is the reason Opus outperforms MP3 at low bitrates. For streaming to web clients (HTML5 <audio> via .ogg or .webm containers), Opus is the right default. For universal compatibility across email clients and older devices, use MP3.

Does SpeakEasy TTS support streaming?

Yes. Use response.iter_bytes(chunk_size=4096) on any TTS response to stream audio chunks as they're generated. This lets you start playback before the full file is synthesized, reducing perceived latency — critical for voice agents and live narration. All voices and both models support streaming at the same $0.20/hour rate.

Build a Text-to-Speech App in 5 Minutes

Prerequisites

Which voice should I pick?

How do I generate speech?

Python (OpenAI SDK)

curl

How do I stream audio?

When should I use HD quality?

How do I get word-level timestamps?

Which output format should I use?

Example: Batch Processing

What's Next?

Frequently Asked Questions

How much does text-to-speech cost on SpeakEasy?

Can I use the OpenAI Python SDK for SpeakEasy TTS?

What's the difference between `tts-1` and `tts-1-hd`?

Which audio format is smallest?

Does SpeakEasy TTS support streaming?

Related reading

Keep reading

Voice Activity Detection in Python: The Complete Guide

Best Open Source TTS Models in 2026: Kokoro, Chatterbox, Fish Audio Compared

How to Build an AI Voice Agent with a Speech-to-Text API

$1. 50 hours. Both STT and TTS.

Prerequisites

Which voice should I pick?

How do I generate speech?

Python (OpenAI SDK)

curl

How do I stream audio?

When should I use HD quality?

How do I get word-level timestamps?

Which output format should I use?

Example: Batch Processing

What's Next?

Frequently Asked Questions

How much does text-to-speech cost on SpeakEasy?

Can I use the OpenAI Python SDK for SpeakEasy TTS?

What's the difference between tts-1 and tts-1-hd?

Which audio format is smallest?

Does SpeakEasy TTS support streaming?

Related reading

Keep reading

Voice Activity Detection in Python: The Complete Guide

Best Open Source TTS Models in 2026: Kokoro, Chatterbox, Fish Audio Compared

How to Build an AI Voice Agent with a Speech-to-Text API

$1. 50 hours. Both STT and TTS.

What's the difference between `tts-1` and `tts-1-hd`?