Benchmarks — SpeakEasy Speech-to-Text Accuracy

Results

SpeakEasy runs Whisper large-v3. We sample 100 clips per condition with a fixed random seed (20260420), transcribe each one through our public API, and score word-error-rate against the official LibriSpeech references.

Condition	Clips	Audio	Corpus WER
LibriSpeech test-clean English, clean read speech sourced from audiobooks. CC-BY 4.0.	100/100	11.7 min	3.41%
LibriSpeech test-other English, noisier/harder read speech sourced from audiobooks. CC-BY 4.0.	100/100	10.2 min	5.52%

Error breakdown

Corpus WER = (substitutions + deletions + insertions) / reference words. Weighting by words (rather than averaging per-clip WER) prevents short clips from dominating the score.

Condition	Ref words	Subs	Dels	Ins	Median clip WER
LibriSpeech test-clean	1,938	56	5	5	0.00%
LibriSpeech test-other	1,649	69	11	11	2.38%

How this compares to published figures

Our test-clean number (3.41%) and test-other number (5.52%) are within the band you'd expect from Whisper large-v3 on these corpora. The original Whisper paper and model card report WER figures for LibriSpeech using different normalization than ours, so exact comparisons to their published numbers aren't meaningful — but the shape (test-other being noisier than test-clean, roughly 1.6× higher error) matches every other published measurement of this model.

We deliberately don't post head-to-head comparisons with competitor APIs on this page. Those numbers are easy to fake and hard to audit. If you want an apples-to-apples comparison, run the benchmark script against any two providers with your own API keys.

Methodology

Scoring uses jiwer's word-error-rate implementation (Levenshtein edit distance over reference words). Normalization applied to both reference and hypothesis before scoring:

NFKC unicode fold
casefold
strip punctuation (Unicode category P)
collapse whitespace
spell out standalone digits 0-9

(substitutions + deletions + insertions) / reference_words

No hand-picked corrections. No retries on failed API calls. The first response from SpeakEasy is what goes into the score.

Reproducibility

The runner is a single Python script in the repo at benchmark/run.py. It downloads LibriSpeech test-clean and test-other directly from OpenSLR (CC BY 4.0, no license paperwork), samples 100 clips deterministically with seed=20260420, and transcribes each through the public /api/v1/audio/transcriptions endpoint:

git clone https://github.com/rara-cyber/speakeasy-benchmark
cd speakeasy-benchmark
pip install -r requirements.txt
export SPEAKEASY_API_KEY=sk-se-...
python run.py --clips 100

Same seed, same clips, same normalization — you should land within ±0.3 points of our numbers. If you don't, email support@tryspeakeasy.ioand we'll debug it together.

What this benchmark is not

English only, for now.We're running LibriSpeech test-clean and test-other. Multilingual WER (Spanish, French, Mandarin) will land here once we have licensed corpora wired up — we'd rather show two honest numbers than six numbers with question marks on provenance.
Audiobook speech is easier than production audio. LibriSpeech is recorded with decent microphones by fluent readers. Your real-world WER on phone calls, meeting audio, or user-generated content will be higher than the numbers on this page. Budget a 1–3 percentage point gap for clean business audio, more for noisy.
WER is not the only thing that matters. Latency, diarization quality, file-size limits, and API ergonomics matter too. See the comparison post for the full multi-axis breakdown.
This page is our own measurement.SpeakEasy ran the benchmark against its own API. We publish the script so you can verify the numbers yourself — but it's not a blind third-party audit.

Run details

Run date: 2026-04-20
Model: whisper-large-v3
Total clips scored: 200
Total audio: 21.9 min
Wall-clock runtime: 49.7 min
Sampling: random seed=20260420, n=100 per condition, no rejections, first API response scored

How we measure accuracy