OpenAI Whisper Compatible

Speech-to-Text API
Accurate. Fast. Affordable.

Transcribe audio files in 99+ languages with state-of-the-art accuracy. Speaker diarization, word-level timestamps, and translation built in.

Code examples

OpenAI-compatible. Use your existing SDK — just change the base URL.

cURL
curl -X POST https://www.tryspeakeasy.io/api/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@interview.mp3" \
  -F "response_format=verbose_json" \
  -F "speaker_labels=true"
Python (OpenAI SDK)
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://www.tryspeakeasy.io/api/v1"
)

transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=open("meeting.mp3", "rb")
)
print(transcript.text)
JavaScript
const formData = new FormData();
formData.append('file', audioFile);
formData.append('language', 'en');

const res = await fetch('https://www.tryspeakeasy.io/api/v1/audio/transcriptions', {
  method: 'POST',
  headers: { Authorization: `Bearer ${'{'}apiKey{'}'}` },
  body: formData,
});
const { text } = await res.json();

What is the SpeakEasy Speech-to-Text API?

SpeakEasy is a pay-as-you-go speech-to-text API that transcribes audio in 99+ languages using OpenAI's Whisper large-v3 model for $0.20 per hour— 44% cheaper than OpenAI's own Whisper endpoint ($0.36/hr), 46% cheaper than AssemblyAI ($0.37/hr), and 79% cheaper than Google Cloud Speech-to-Text ($0.96/hr+). Accuracy is identical because the underlying model is identical: Whisper large-v3, 1.55 billion parameters, released by OpenAI in November 2023.

The API is drop-in compatible with the OpenAI Python and Node.js SDKs. Switching from OpenAI takes two lines of code — change base_url and api_key. Your existing transcription pipeline keeps working, your bill drops by roughly half.

How the API works

You POST an audio file to /api/v1/audio/transcriptions as multipart form data. We stream the file to a GPU-backed Whisper large-v3 inference pool, run transcription, and return the result as JSON, plain text, SRT, or WebVTT — you pick via the response_format parameter.

For files under about 30 minutes the entire round-trip typically completes in 10–15 seconds. Longer files are processed asynchronously: supply a callback_url and we POST the result back when processing finishes. No polling, no SSE wiring to maintain.

Every request is authenticated with a Bearer token (sk-… format, same shape as OpenAI) generated from your dashboard. Keys are hashed at rest with HMAC-SHA256 — we never store raw keys. Usage is metered per second of audio, billed monthly, and surfaces on your dashboard with per-day breakdown.

Rate limit: 10 concurrent requests per account on the default plan. File size cap: 100 MB per upload. Supported formats: mp3, wav, flac, aac, opus, ogg, m4a, mp4, mpeg, mov, webm.

What developers build with it

Every use case below is live in production across the SpeakEasy customer base.

Meeting & interview transcripts

Enable speaker_labels=true and get per-speaker transcripts of up to 4 voices. Typical Zoom recording (45 minutes) transcribes in ~18 seconds for $0.15.

Podcast & video subtitles

Request response_format=srt or vtt and drop the output directly into YouTube, Vimeo, or any HTML5 video element. Word-level timestamps keep subtitles tight.

Voice notes & dictation apps

Mobile clients upload short clips, get back clean text. 99+ languages covered, so a single endpoint handles English, Spanish, Mandarin, Hindi, Arabic, and anything else your users throw at it.

Call-center QA & compliance

Transcribe full call recordings with speaker diarization for sentiment analysis, compliance review, and searchable call archives. 10 concurrent slots on default plans cover most production call volumes.

Multilingual content translation

Set translate=true to transcribe any of 99+ languages directly into English. One API call replaces a Whisper + GPT translation pipeline.

Audio search & indexing

Verbose JSON output includes word-level timestamps — index every word in a vector DB and you can jump users to the exact second a phrase was spoken in a 2-hour podcast.

Supported audio formats

.mp3.wav.flac.aac.opus.ogg.m4a.mp4.mpeg.mov.webm

Up to 100MB per file

Built for production

99+ Languages

Automatic language detection or specify the language for faster processing. Supports English, Chinese, Spanish, French, German, Japanese, and 95+ more.

Speaker Diarization

Identify who said what. Our API labels up to 4 speakers automatically, perfect for meetings, interviews, and podcasts.

Word Timestamps

Get precise start and end times for each word. Essential for subtitle generation, search indexing, and audio editing.

Translation

Transcribe audio in any language and translate the output to English in a single API call.

Multiple Formats

Get results as plain text, JSON, verbose JSON with timestamps, SRT subtitles, or WebVTT.

Async Processing

For long files, provide a callback URL and we'll POST the results when processing is complete.

How SpeakEasy Compares

Same Whisper large-v3 accuracy. A fraction of the cost.

ProviderPrice / hourDiarizationOpenAI SDK
SpeakEasy$0.20
OpenAI Whisper$0.36
AssemblyAI$0.37
Deepgram$0.25
Google Cloud STT$0.96+

All providers use equivalent Whisper large-v3 accuracy. Pricing as of April 2026.

Speech-to-Text Pricing

Just $0.20/hour (plan rate) of audio. No hidden fees.

$10/month includes 50 hours. Additional usage at $0.25 per hour. $1 first month.

View full pricing →

Frequently asked questions

What audio formats are supported?
We support mp3, wav, flac, aac, opus, ogg, m4a, mp4, mpeg, mov, webm, and more. Files up to 100MB per file.
How accurate is the transcription?
We use OpenAI's Whisper large-v3 model. In OpenAI's published benchmarks (2022), Whisper large-v3 achieves 2.7% WER on LibriSpeech clean audio. Real-world WER is typically under 5% for clear recordings across 99+ languages.
Do you support speaker diarization?
Yes! Enable speaker_labels in your request to automatically identify and label up to 4 different speakers in your audio.
Can I translate audio to English?
Yes. Set the translate parameter to true and we'll transcribe any language and translate the output to English.
Is there a file size limit?
We support files up to 100MB per file.
How fast is the transcription?
Most audio files are transcribed within 10–15 seconds. Longer files (30+ minutes) may take longer. We also support async processing via callback URLs.

$1. 50 hours. Both STT and TTS.

Your current speech API provider is charging you too much. Switch in one line of code.