·SpeakEasy Team·4 min read

Build a Text-to-Speech App in 5 Minutes

Generate natural TTS audio in under 5 lines of Python via the OpenAI SDK pointed at SpeakEasy. 6 voices, streaming, HD mode, and 5 formats at $0.20/hour.

Text-to-SpeechTutorial

Want to add voice output to your application? With the SpeakEasy text to speech API, you can generate natural-sounding audio from text in just a few lines of code. This guide walks you through voice selection, API calls, streaming, and timestamps.

TL;DR

  • Install openai, set base_url="https://www.tryspeakeasy.io/api/v1", call client.audio.speech.create(model="tts-1", voice="nova", input="..."). That's a working TTS integration.
  • 6 built-in voices (alloy, echo, fable, onyx, nova, shimmer) — same IDs as OpenAI's TTS API, so swapping providers takes one line.
  • Pricing: $0.20/hour of generated audio (≈66,667 characters). HD quality via model="tts-1-hd"; streaming, 5 output formats (mp3/opus/aac/flac/wav), and word-level timestamps all included.

Prerequisites

Which voice should I pick?

SpeakEasy offers a range of voices optimized for different use cases. The voice IDs are identical to OpenAI's TTS API, so any sample clips you've heard from OpenAI's nova or fable sound the same here. Here's how we think about each:

VoiceStyleBest For
alloyNeutral, balancedGeneral purpose, documentation
echoWarm, conversationalPodcasts, narration
fableExpressive, dynamicStorytelling, audiobooks
onyxDeep, authoritativePresentations, announcements
novaFriendly, upbeatCustomer-facing apps, assistants
shimmerCalm, clearMeditation, accessibility

Browse all available voices in the text-to-speech documentation. In practice we default to nova for product UIs (friendly but not sugary), echo for long-form narration, and onyx for onboarding videos where authority matters.

How do I generate speech?

Python (OpenAI SDK)

from openai import OpenAI
from pathlib import Path

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://www.tryspeakeasy.io/api/v1",
)

response = client.audio.speech.create(
    model="tts-1",
    voice="nova",
    input="Welcome to SpeakEasy. Let's build something great together.",
)

Path("output.mp3").write_bytes(response.content)
print("Audio saved to output.mp3")

curl

curl -X POST https://www.tryspeakeasy.io/api/v1/audio/speech \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "voice": "nova",
    "input": "Welcome to SpeakEasy. Let us build something great together."
  }' \
  --output output.mp3

Both approaches produce an MP3 file ready to play. That's your text-to-speech integration done.

How do I stream audio?

For real-time applications like voice assistants or live narration, stream the audio instead of waiting for the complete file:

response = client.audio.speech.create(
    model="tts-1",
    voice="nova",
    input="This audio is being streamed in real time.",
)

# Stream to file in chunks
with open("streamed.mp3", "wb") as f:
    for chunk in response.iter_bytes(chunk_size=4096):
        f.write(chunk)

Streaming lets you start playback while the rest of the audio is still being generated, dramatically reducing perceived latency.

When should I use HD quality?

For production content where audio quality matters — audiobooks, podcasts, or marketing material — use the HD model:

response = client.audio.speech.create(
    model="tts-1-hd",
    voice="fable",
    input="Premium audio quality for content that matters.",
)

The tts-1-hd model produces higher fidelity output with more natural prosody. It takes slightly longer to generate but the difference in quality is noticeable.

How do I get word-level timestamps?

Need to sync text highlighting with audio playback? Request word-level timestamps:

curl -X POST https://www.tryspeakeasy.io/api/v1/audio/speech \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "voice": "alloy",
    "input": "Each word gets a precise timestamp.",
    "response_format": "json"
  }'

The response includes timing data for each word, enabling karaoke-style text highlighting, subtitle synchronization, and accessible reading tools.

Which output format should I use?

SpeakEasy supports five audio formats. Each strikes a different balance between file size, quality, and decoder compatibility — summarized by the underlying codec characteristics (RFC 6716 for Opus, IETF FLAC spec for FLAC). Specify the one your application needs:

response = client.audio.speech.create(
    model="tts-1",
    voice="echo",
    input="Choose the format that fits your use case.",
    response_format="opus",  # Options: mp3, opus, aac, flac, wav
)
  • mp3 — Best compatibility, good quality. Default.
  • opus — Smallest file size, ideal for streaming and web.
  • flac — Lossless, best for archival.
  • wav — Uncompressed, best for further audio processing.
  • aac — Good balance, preferred on Apple platforms.

Example: Batch Processing

Generate audio for multiple text blocks, such as turning a blog into a podcast:

from pathlib import Path

sections = [
    "Introduction. Welcome to today's episode.",
    "Chapter one. We begin with the fundamentals.",
    "Conclusion. Thanks for listening.",
]

for i, text in enumerate(sections):
    response = client.audio.speech.create(
        model="tts-1-hd",
        voice="echo",
        input=text,
    )
    Path(f"section_{i}.mp3").write_bytes(response.content)
    print(f"Generated section {i}")

What's Next?

You now have everything you need to add text-to-speech to your application. Whether you're building a voice assistant, generating audiobooks, or adding accessibility features, SpeakEasy makes it straightforward.

Explore the full text-to-speech documentation for all options, or check out our speech-to-text API to complete the loop with transcription.

Frequently Asked Questions

How much does text-to-speech cost on SpeakEasy?

SpeakEasy bills TTS at $0.20 per hour of generated audio — equivalent to roughly 66,667 characters — with 50 hours included in the $10/month plan and $1 for the first month. Both tts-1 and tts-1-hd are billed at the same rate, so HD quality carries no price premium.

Can I use the OpenAI Python SDK for SpeakEasy TTS?

Yes. SpeakEasy is a drop-in replacement for OpenAI's TTS endpoint. Point the OpenAI SDK at https://www.tryspeakeasy.io/api/v1, use the same voice IDs (alloy, echo, fable, onyx, nova, shimmer) and models (tts-1, tts-1-hd), and every existing integration works without code changes.

What's the difference between tts-1 and tts-1-hd?

tts-1 is optimized for low latency — useful for voice agents and real-time applications where first-byte latency matters more than audio polish. tts-1-hd produces higher-fidelity output with more natural prosody at the cost of slightly longer generation time. For audiobooks, podcasts, and marketing material, prefer tts-1-hd.

Which audio format is smallest?

Opus produces the smallest files at equivalent perceived quality — RFC 6716 defines its hybrid SILK/CELT codec, which is the reason Opus outperforms MP3 at low bitrates. For streaming to web clients (HTML5 <audio> via .ogg or .webm containers), Opus is the right default. For universal compatibility across email clients and older devices, use MP3.

Does SpeakEasy TTS support streaming?

Yes. Use response.iter_bytes(chunk_size=4096) on any TTS response to stream audio chunks as they're generated. This lets you start playback before the full file is synthesized, reducing perceived latency — critical for voice agents and live narration. All voices and both models support streaming at the same $0.20/hour rate.

Related reading

Get your API key — $1 first month, 50 hours included — and start generating speech in minutes.

Keep reading

$1. 50 hours. Both STT and TTS.

Your current speech API provider is charging you too much. Switch in one line of code.