Benchmarks

How we measure accuracy

100 clips per condition, scored against reference transcripts, no retries. These are the numbers we actually measured, not numbers we copied from a model card. Last run 2026-04-20.

Results

SpeakEasy runs Whisper large-v3. We sample 100 clips per condition with a fixed random seed (20260420), transcribe each one through our public API, and score word-error-rate against the official LibriSpeech references.

ConditionClipsAudioCorpus WER
LibriSpeech test-clean
English, clean read speech sourced from audiobooks. CC-BY 4.0.
100/10011.7 min3.41%
LibriSpeech test-other
English, noisier/harder read speech sourced from audiobooks. CC-BY 4.0.
100/10010.2 min5.52%

Error breakdown

Corpus WER = (substitutions + deletions + insertions) / reference words. Weighting by words (rather than averaging per-clip WER) prevents short clips from dominating the score.

ConditionRef wordsSubsDelsInsMedian clip WER
LibriSpeech test-clean1,93856550.00%
LibriSpeech test-other1,6496911112.38%

How this compares to published figures

Our test-clean number (3.41%) and test-other number (5.52%) are within the band you'd expect from Whisper large-v3 on these corpora. The original Whisper paper and model card report WER figures for LibriSpeech using different normalization than ours, so exact comparisons to their published numbers aren't meaningful — but the shape (test-other being noisier than test-clean, roughly 1.6× higher error) matches every other published measurement of this model.

We deliberately don't post head-to-head comparisons with competitor APIs on this page. Those numbers are easy to fake and hard to audit. If you want an apples-to-apples comparison, run the benchmark script against any two providers with your own API keys.

Methodology

Scoring uses jiwer's word-error-rate implementation (Levenshtein edit distance over reference words). Normalization applied to both reference and hypothesis before scoring:

  • NFKC unicode fold
  • casefold
  • strip punctuation (Unicode category P)
  • collapse whitespace
  • spell out standalone digits 0-9
(substitutions + deletions + insertions) / reference_words

No hand-picked corrections. No retries on failed API calls. The first response from SpeakEasy is what goes into the score.

Reproducibility

The runner is a single Python script in the repo at benchmark/run.py. It downloads LibriSpeech test-clean and test-other directly from OpenSLR (CC BY 4.0, no license paperwork), samples 100 clips deterministically with seed=20260420, and transcribes each through the public /api/v1/audio/transcriptions endpoint:

git clone https://github.com/rara-cyber/speakeasy-benchmark
cd speakeasy-benchmark
pip install -r requirements.txt
export SPEAKEASY_API_KEY=sk-se-...
python run.py --clips 100

Same seed, same clips, same normalization — you should land within ±0.3 points of our numbers. If you don't, email support@tryspeakeasy.ioand we'll debug it together.

What this benchmark is not

  • English only, for now.We're running LibriSpeech test-clean and test-other. Multilingual WER (Spanish, French, Mandarin) will land here once we have licensed corpora wired up — we'd rather show two honest numbers than six numbers with question marks on provenance.
  • Audiobook speech is easier than production audio. LibriSpeech is recorded with decent microphones by fluent readers. Your real-world WER on phone calls, meeting audio, or user-generated content will be higher than the numbers on this page. Budget a 1–3 percentage point gap for clean business audio, more for noisy.
  • WER is not the only thing that matters. Latency, diarization quality, file-size limits, and API ergonomics matter too. See the comparison post for the full multi-axis breakdown.
  • This page is our own measurement.SpeakEasy ran the benchmark against its own API. We publish the script so you can verify the numbers yourself — but it's not a blind third-party audit.

Run details

  • Run date: 2026-04-20
  • Model: whisper-large-v3
  • Total clips scored: 200
  • Total audio: 21.9 min
  • Wall-clock runtime: 49.7 min
  • Sampling: random seed=20260420, n=100 per condition, no rejections, first API response scored

Try it yourself

Run your own audio against SpeakEasy for $1. 50 hours included in the first month — plenty for reproducing any number on this page, or scoring your own corpus.

$1. 50 hours. Both STT and TTS.

Your current speech API provider is charging you too much. Switch in one line of code.