Benchmarks
How we measure accuracy
100 clips per condition, scored against reference transcripts, no retries. These are the numbers we actually measured, not numbers we copied from a model card. Last run 2026-04-20.
Results
SpeakEasy runs Whisper large-v3. We sample 100 clips per condition with a fixed random seed (20260420), transcribe each one through our public API, and score word-error-rate against the official LibriSpeech references.
| Condition | Clips | Audio | Corpus WER |
|---|---|---|---|
| LibriSpeech test-clean English, clean read speech sourced from audiobooks. CC-BY 4.0. | 100/100 | 11.7 min | 3.41% |
| LibriSpeech test-other English, noisier/harder read speech sourced from audiobooks. CC-BY 4.0. | 100/100 | 10.2 min | 5.52% |
Error breakdown
Corpus WER = (substitutions + deletions + insertions) / reference words. Weighting by words (rather than averaging per-clip WER) prevents short clips from dominating the score.
| Condition | Ref words | Subs | Dels | Ins | Median clip WER |
|---|---|---|---|---|---|
| LibriSpeech test-clean | 1,938 | 56 | 5 | 5 | 0.00% |
| LibriSpeech test-other | 1,649 | 69 | 11 | 11 | 2.38% |
How this compares to published figures
Our test-clean number (3.41%) and test-other number (5.52%) are within the band you'd expect from Whisper large-v3 on these corpora. The original Whisper paper and model card report WER figures for LibriSpeech using different normalization than ours, so exact comparisons to their published numbers aren't meaningful — but the shape (test-other being noisier than test-clean, roughly 1.6× higher error) matches every other published measurement of this model.
We deliberately don't post head-to-head comparisons with competitor APIs on this page. Those numbers are easy to fake and hard to audit. If you want an apples-to-apples comparison, run the benchmark script against any two providers with your own API keys.
Methodology
Scoring uses jiwer's word-error-rate implementation (Levenshtein edit distance over reference words). Normalization applied to both reference and hypothesis before scoring:
- NFKC unicode fold
- casefold
- strip punctuation (Unicode category P)
- collapse whitespace
- spell out standalone digits 0-9
(substitutions + deletions + insertions) / reference_wordsNo hand-picked corrections. No retries on failed API calls. The first response from SpeakEasy is what goes into the score.
Reproducibility
The runner is a single Python script in the repo at benchmark/run.py. It downloads LibriSpeech test-clean and test-other directly from OpenSLR (CC BY 4.0, no license paperwork), samples 100 clips deterministically with seed=20260420, and transcribes each through the public /api/v1/audio/transcriptions endpoint:
git clone https://github.com/rara-cyber/speakeasy-benchmark
cd speakeasy-benchmark
pip install -r requirements.txt
export SPEAKEASY_API_KEY=sk-se-...
python run.py --clips 100Same seed, same clips, same normalization — you should land within ±0.3 points of our numbers. If you don't, email support@tryspeakeasy.ioand we'll debug it together.
What this benchmark is not
- English only, for now.We're running LibriSpeech test-clean and test-other. Multilingual WER (Spanish, French, Mandarin) will land here once we have licensed corpora wired up — we'd rather show two honest numbers than six numbers with question marks on provenance.
- Audiobook speech is easier than production audio. LibriSpeech is recorded with decent microphones by fluent readers. Your real-world WER on phone calls, meeting audio, or user-generated content will be higher than the numbers on this page. Budget a 1–3 percentage point gap for clean business audio, more for noisy.
- WER is not the only thing that matters. Latency, diarization quality, file-size limits, and API ergonomics matter too. See the comparison post for the full multi-axis breakdown.
- This page is our own measurement.SpeakEasy ran the benchmark against its own API. We publish the script so you can verify the numbers yourself — but it's not a blind third-party audit.
Run details
- Run date: 2026-04-20
- Model: whisper-large-v3
- Total clips scored: 200
- Total audio: 21.9 min
- Wall-clock runtime: 49.7 min
- Sampling: random seed=20260420, n=100 per condition, no rejections, first API response scored
Try it yourself
Run your own audio against SpeakEasy for $1. 50 hours included in the first month — plenty for reproducing any number on this page, or scoring your own corpus.
$1. 50 hours. Both STT and TTS.
Your current speech API provider is charging you too much. Switch in one line of code.