Speech-to-Text API
Accurate. Fast. Affordable.
Transcribe audio files in 99+ languages with state-of-the-art accuracy. Speaker diarization, word-level timestamps, and translation built in.
Code examples
OpenAI-compatible. Use your existing SDK — just change the base URL.
curl -X POST https://www.tryspeakeasy.io/api/v1/audio/transcriptions \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@interview.mp3" \
-F "response_format=verbose_json" \
-F "speaker_labels=true"from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://www.tryspeakeasy.io/api/v1"
)
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=open("meeting.mp3", "rb")
)
print(transcript.text)const formData = new FormData();
formData.append('file', audioFile);
formData.append('language', 'en');
const res = await fetch('https://www.tryspeakeasy.io/api/v1/audio/transcriptions', {
method: 'POST',
headers: { Authorization: `Bearer ${'{'}apiKey{'}'}` },
body: formData,
});
const { text } = await res.json();What is the SpeakEasy Speech-to-Text API?
SpeakEasy is a pay-as-you-go speech-to-text API that transcribes audio in 99+ languages using OpenAI's Whisper large-v3 model for $0.20 per hour— 44% cheaper than OpenAI's own Whisper endpoint ($0.36/hr), 46% cheaper than AssemblyAI ($0.37/hr), and 79% cheaper than Google Cloud Speech-to-Text ($0.96/hr+). Accuracy is identical because the underlying model is identical: Whisper large-v3, 1.55 billion parameters, released by OpenAI in November 2023.
The API is drop-in compatible with the OpenAI Python and Node.js SDKs. Switching from OpenAI takes two lines of code — change base_url and api_key. Your existing transcription pipeline keeps working, your bill drops by roughly half.
How the API works
You POST an audio file to /api/v1/audio/transcriptions as multipart form data. We stream the file to a GPU-backed Whisper large-v3 inference pool, run transcription, and return the result as JSON, plain text, SRT, or WebVTT — you pick via the response_format parameter.
For files under about 30 minutes the entire round-trip typically completes in 10–15 seconds. Longer files are processed asynchronously: supply a callback_url and we POST the result back when processing finishes. No polling, no SSE wiring to maintain.
Every request is authenticated with a Bearer token (sk-… format, same shape as OpenAI) generated from your dashboard. Keys are hashed at rest with HMAC-SHA256 — we never store raw keys. Usage is metered per second of audio, billed monthly, and surfaces on your dashboard with per-day breakdown.
Rate limit: 10 concurrent requests per account on the default plan. File size cap: 100 MB per upload. Supported formats: mp3, wav, flac, aac, opus, ogg, m4a, mp4, mpeg, mov, webm.
What developers build with it
Every use case below is live in production across the SpeakEasy customer base.
Meeting & interview transcripts
Enable speaker_labels=true and get per-speaker transcripts of up to 4 voices. Typical Zoom recording (45 minutes) transcribes in ~18 seconds for $0.15.
Podcast & video subtitles
Request response_format=srt or vtt and drop the output directly into YouTube, Vimeo, or any HTML5 video element. Word-level timestamps keep subtitles tight.
Voice notes & dictation apps
Mobile clients upload short clips, get back clean text. 99+ languages covered, so a single endpoint handles English, Spanish, Mandarin, Hindi, Arabic, and anything else your users throw at it.
Call-center QA & compliance
Transcribe full call recordings with speaker diarization for sentiment analysis, compliance review, and searchable call archives. 10 concurrent slots on default plans cover most production call volumes.
Multilingual content translation
Set translate=true to transcribe any of 99+ languages directly into English. One API call replaces a Whisper + GPT translation pipeline.
Audio search & indexing
Verbose JSON output includes word-level timestamps — index every word in a vector DB and you can jump users to the exact second a phrase was spoken in a 2-hour podcast.
Supported audio formats
Up to 100MB per file
Built for production
99+ Languages
Automatic language detection or specify the language for faster processing. Supports English, Chinese, Spanish, French, German, Japanese, and 95+ more.
Speaker Diarization
Identify who said what. Our API labels up to 4 speakers automatically, perfect for meetings, interviews, and podcasts.
Word Timestamps
Get precise start and end times for each word. Essential for subtitle generation, search indexing, and audio editing.
Translation
Transcribe audio in any language and translate the output to English in a single API call.
Multiple Formats
Get results as plain text, JSON, verbose JSON with timestamps, SRT subtitles, or WebVTT.
Async Processing
For long files, provide a callback URL and we'll POST the results when processing is complete.
How SpeakEasy Compares
Same Whisper large-v3 accuracy. A fraction of the cost.
| Provider | Price / hour | Diarization | OpenAI SDK |
|---|---|---|---|
| SpeakEasy | $0.20 | ✓ | ✓ |
| OpenAI Whisper | $0.36 | ✗ | ✓ |
| AssemblyAI | $0.37 | ✓ | ✗ |
| Deepgram | $0.25 | ✓ | ✗ |
| Google Cloud STT | $0.96+ | ✓ | ✗ |
All providers use equivalent Whisper large-v3 accuracy. Pricing as of April 2026.
Speech-to-Text Pricing
Just $0.20/hour (plan rate) of audio. No hidden fees.
$10/month includes 50 hours. Additional usage at $0.25 per hour. $1 first month.
View full pricing →Learn more
Frequently asked questions
What audio formats are supported?
How accurate is the transcription?
Do you support speaker diarization?
Can I translate audio to English?
Is there a file size limit?
How fast is the transcription?
$1. 50 hours. Both STT and TTS.
Your current speech API provider is charging you too much. Switch in one line of code.