OpenAI Whisper Compatible

Speech-to-Text API
Accurate. Fast. Affordable.

Transcribe audio files in 99+ languages with state-of-the-art accuracy. Speaker diarization, word-level timestamps, and translation built in.

By SpeakEasy Team · Last updated

Supported audio formats

.mp3.wav.flac.aac.opus.ogg.m4a.mp4.mpeg.mov.webm

Up to 25MB via upload · Up to 1GB via URL

Code examples

OpenAI-compatible. Use your existing SDK — just change the base URL.

cURL
curl -X POST https://api.tryspeakeasy.io/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@interview.mp3" \
  -F "response_format=verbose_json" \
  -F "speaker_labels=true"
Python (OpenAI SDK)
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.tryspeakeasy.io/v1"
)

transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=open("meeting.mp3", "rb")
)
print(transcript.text)
JavaScript
const formData = new FormData();
formData.append('file', audioFile);
formData.append('language', 'en');

const res = await fetch('https://api.tryspeakeasy.io/v1/audio/transcriptions', {
  method: 'POST',
  headers: { Authorization: `Bearer ${'{'}apiKey{'}'}` },
  body: formData,
});
const { text } = await res.json();

Built for production

99+ Languages

Automatic language detection or specify the language for faster processing. Supports English, Chinese, Spanish, French, German, Japanese, and 95+ more.

Speaker Diarization

Identify who said what. Our API labels up to 4 speakers automatically, perfect for meetings, interviews, and podcasts.

Word Timestamps

Get precise start and end times for each word. Essential for subtitle generation, search indexing, and audio editing.

Translation

Transcribe audio in any language and translate the output to English in a single API call.

Multiple Formats

Get results as plain text, JSON, verbose JSON with timestamps, SRT subtitles, or WebVTT.

Async Processing

For long files, provide a callback URL and we'll POST the results when processing is complete.

How SpeakEasy Compares

Same Whisper large-v3 accuracy. A fraction of the cost.

ProviderPrice / hourDiarizationOpenAI SDK
SpeakEasy$0.20
OpenAI Whisper$0.36
AssemblyAI$0.37
Deepgram$0.25
Google Cloud STT$0.96+

All providers use equivalent Whisper large-v3 accuracy. Pricing as of April 2026.

Speech-to-Text Pricing

Just $0.20/hour (plan rate) of audio. No hidden fees.

$10/month includes 50 hours. Additional usage at $0.25 per hour. $1 first month.

View full pricing →

Frequently asked questions

What audio formats are supported?
We support mp3, wav, flac, aac, opus, ogg, m4a, mp4, mpeg, mov, webm, and more. Files up to 25MB via upload, or up to 1GB via URL.
How accurate is the transcription?
We use OpenAI's Whisper large-v3 model. In OpenAI's published benchmarks (2022), Whisper large-v3 achieves 2.7% WER on LibriSpeech clean audio. Real-world WER is typically under 5% for clear recordings across 99+ languages.
Do you support speaker diarization?
Yes! Enable speaker_labels in your request to automatically identify and label up to 4 different speakers in your audio.
Can I translate audio to English?
Yes. Set the translate parameter to true and we'll transcribe any language and translate the output to English.
Is there a file size limit?
Direct uploads support files up to 25MB. For larger files, pass a URL and we support up to 1GB.
How fast is the transcription?
Most audio files are transcribed in under a second. Longer files (30+ minutes) may take a few seconds. We also support async processing via callback URLs.

$1. 50 hours. Both STT and TTS.

Your current speech API provider is charging you too much. Switch in one line of code.

SPEAKY