Python Speech-to-Text API: Transcribe Audio in 5 Lines of Code

Q: Is there a free Python speech-to-text library?

Yes — libraries like `openai-whisper` and `vosk` run locally for free, but require a GPU for acceptable speed and produce lower accuracy than hosted models. For production use, a hosted API like SpeakEasy delivers faster, more accurate results without infrastructure overhead, at $0.20/hr.

Last updated: April 17, 2026

Need a reliable python speech to text API? In this tutorial, you'll learn how to transcribe audio files using the SpeakEasy API in just a few lines of Python code. Since SpeakEasy is fully compatible with the OpenAI SDK, there's no new library to learn — just point your existing client at a different URL.

TL;DR

Install openai, point base_url at https://www.tryspeakeasy.io/api/v1, call client.audio.transcriptions.create(model="whisper-large-v3", file=...). That's a working Python transcription.
Want to see the exact request running before you write any code? audiotranscribe.app/python is the live playground for this snippet — drop a file, see the transcript, copy the Python.
Runs Whisper large-v3 — the same model as OpenAI's Transcription API, achieving 4.2% WER on LibriSpeech test-clean (Radford et al., 2022) — at $0.20/hour vs. $0.36, a 44% cost reduction.
Speaker diarization (diarize=True), word-level timestamps, SRT/VTT subtitles, 99+ languages, and async webhook delivery are all included at the base rate.

What is a Python speech-to-text API?

A Python speech-to-text API is a hosted service that converts audio recordings into written text, accessed via a Python SDK. You send an audio file (or URL) to a remote endpoint and receive a JSON transcript — usually within 10 to 15 seconds for a 60-second clip. Unlike local libraries such as openai-whisper or vosk, a hosted API eliminates GPU management, scales automatically, and returns results an order of magnitude faster on equivalent hardware.

Why use SpeakEasy's Python speech-to-text API?

SpeakEasy runs the same Whisper large-v3 model as OpenAI's Transcription API — which achieves 4.2% word error rate on the LibriSpeech test-clean benchmark (Radford et al., 2022) — at $0.20 per hour of audio instead of $0.36. That's a 44% cost reduction for identical accuracy, with speaker diarization included at no extra charge.

Key advantages for Python developers:

Zero new SDK needed. SpeakEasy is 100% compatible with the OpenAI Python library. Change one line and you're done.
Speaker diarization built in. Identify who said what without a separate service.
Word-level timestamps. Get precise timing data for subtitles, search indexing, and video sync.
99+ languages supported. Automatic language detection or explicit ISO 639-1 codes.

How does SpeakEasy compare to other Python STT options?

Option	Model	Accuracy (WER)	Price/hr	Diarization	Async
SpeakEasy	Whisper large-v3	4.2%	$0.20	Included	Yes (webhook)
OpenAI Whisper API	Whisper large-v3	4.2%	$0.36	No	No
Local `openai-whisper`	Whisper large-v3	4.2%	GPU cost	DIY (pyannote)	N/A
`vosk` (local)	Kaldi	8–15%	CPU cost	Limited	N/A

See the full speech-to-text documentation for the complete parameter reference, or the deeper comparison in our best speech-to-text APIs in 2026 roundup.

Prerequisites

Python 3.8 or later
pip for package installation
A SpeakEasy API key (get one on the pricing page)
An audio file in a supported format (MP3, WAV, FLAC, M4A, OGG, WEBM, and more)

Step 1: Install the OpenAI SDK

No new SDK needed — SpeakEasy is 100% compatible with the OpenAI Python library. If you've used OpenAI's transcription API before, you already know how to use SpeakEasy.

pip install openai

That's the only dependency. No extra packages, no custom clients.

How do I transcribe an audio file in Python?

A minimal Python transcription call requires three lines of application code after installing the OpenAI SDK: import OpenAI, instantiate the client with your API key and the SpeakEasy base_url (https://www.tryspeakeasy.io/api/v1), then call client.audio.transcriptions.create with model="whisper-large-v3" and an open file handle.

Create a file called transcribe.py and add the following:

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://www.tryspeakeasy.io/api/v1",
)

with open("audio.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=audio_file,
    )

print(transcript.text)

Run it with python transcribe.py and you'll see your transcript printed to the console. For a 60-second audio file, expect a result within 10–15 seconds.

Example output:

Thanks for joining today's call. Let's get started with the quarterly review.

That's all it takes for basic transcription. The next steps add timestamps, speaker labels, subtitles, and more.

How do I get word-level timestamps?

Whisper large-v3 exposes two timestamp granularities through SpeakEasy's API: segment (typical 2–5 second chunks) and word (per-word timing accurate to roughly 20 ms). Pass both via timestamp_granularities=["word", "segment"] with response_format="verbose_json" to enable click-to-seek players, karaoke subtitles, and searchable transcripts.

Timestamps are essential for subtitles, search indexing, and syncing text to video. They let you jump directly to a specific word in a recording, build click-to-seek players, or create time-coded transcripts for legal or medical use cases.

Add the timestamp_granularities parameter:

transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=audio_file,
    response_format="verbose_json",
    timestamp_granularities=["word", "segment"],
)

for segment in transcript.segments:
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")

Example output:

[0.00s - 2.40s]  Thanks for joining today's call.
[2.40s - 5.10s]  Let's get started with the quarterly review.
[5.10s - 9.80s]  First, I want to walk through the numbers from last month.

Use timestamp_granularities=["word"] if you need per-word timing rather than per-segment.

How do I identify speakers in multi-speaker audio?

Speaker diarization labels each segment with a speaker identifier (Speaker 0, Speaker 1, …) in order of first appearance. SpeakEasy includes diarization in its base $0.20/hour rate via extra_body={"diarize": True} — unlike OpenAI's Whisper API, which does not offer diarization and typically requires a separate service such as pyannote.audio.

Identify who said what with SpeakEasy's built-in speaker diarization. This is invaluable for meeting notes, interviews, podcast transcripts, and any multi-speaker recording where you need to attribute statements to individuals.

transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=audio_file,
    response_format="verbose_json",
    extra_body={"diarize": True},
)

for segment in transcript.segments:
    speaker = segment.get("speaker", "Unknown")
    print(f"Speaker {speaker}: {segment.text}")

Example output showing two speakers in a recorded interview:

Speaker 0: Welcome to the podcast. Can you tell us a bit about your background?
Speaker 1: Sure, I've been working in machine learning for about eight years now.
Speaker 0: That's a great foundation. What drew you to the field originally?
Speaker 1: Honestly, it started with a course on natural language processing in college.
Speaker 0: And how has the field changed since then?
Speaker 1: The shift to transformer architectures changed everything. Especially in the last three years.

Speakers are labeled Speaker 0, Speaker 1, and so on in order of first appearance.

How do I transcribe audio from a URL?

For large files, uploading the raw bytes can be slow. Instead, pass a publicly accessible URL using the extra_body parameter and SpeakEasy will fetch the file directly:

transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=("audio.mp3", b""),  # placeholder required by the SDK
    extra_body={"url": "https://your-storage-bucket.com/recordings/meeting.mp3"},
)

print(transcript.text)

This is the recommended approach for files stored in S3, Google Cloud Storage, or any public CDN. The file stays in your infrastructure — SpeakEasy fetches it server-side and returns the transcript.

How do I generate SRT subtitles?

SRT is the most widely supported subtitle format, compatible with YouTube, VLC, Final Cut Pro, and virtually every video platform. Setting response_format="srt" returns a ready-to-use subtitle file instead of plain text:

with open("audio.mp3", "rb") as audio_file:
    srt_output = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=audio_file,
        response_format="srt",
    )

# srt_output is a string — save it directly to a file
with open("subtitles.srt", "w") as f:
    f.write(srt_output)

print("SRT file saved to subtitles.srt")

The output is standard SRT format:

1
00:00:00,000 --> 00:00:02,400
Thanks for joining today's call.

2
00:00:02,400 --> 00:00:05,100
Let's get started with the quarterly review.

See the SRT subtitles guide for VTT format and additional subtitle options.

How do I transcribe long audio files without blocking?

For audio files longer than 10 minutes, synchronous HTTP requests often hit load-balancer or serverless-function timeouts — most platforms default to 30 to 120 seconds (AWS Lambda caps at 15 minutes, Vercel functions at 300s). Passing extra_body={"callback_url": "…"} switches the request to fire-and-forget mode: SpeakEasy returns 200 immediately and POSTs the completed transcript to your webhook when processing finishes.

For recordings longer than 10 minutes — full meetings, lectures, or long-form interviews — synchronous requests can time out. Use the callback_url parameter to process the file asynchronously. SpeakEasy will POST the completed transcript to your endpoint when it's ready:

with open("long_recording.mp3", "rb") as audio_file:
    client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=audio_file,
        extra_body={"callback_url": "https://your-server.com/webhook/transcript"},
    )

print("Transcription started. Result will be delivered to your webhook.")

Your webhook endpoint receives a POST request with the full transcript object in the same format as a synchronous response. This keeps your application responsive regardless of audio length.

How do I handle errors in production?

Production code needs proper error handling. Here's a robust pattern that covers the most common failure modes:

from openai import OpenAI, APIError, APIConnectionError

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://www.tryspeakeasy.io/api/v1",
)

try:
    with open("audio.mp3", "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-large-v3",
            file=audio_file,
        )
    print(transcript.text)
except FileNotFoundError:
    print("Audio file not found. Check the file path.")
except APIConnectionError:
    print("Could not connect to SpeakEasy API. Check your network.")
except APIError as e:
    print(f"API error: {e.status_code} - {e.message}")

Common error codes and fixes

Status	Error	Fix
`401`	`invalid_api_key`	Verify the key starts with `sk-` and isn't revoked in the SpeakEasy dashboard. Regenerate if unsure.
`413`	`file_too_large`	File exceeds 100 MB. Use `extra_body={"url": "…"}` for files up to 1 GB, or split into segments.
`415`	`unsupported_media_type`	Ensure the file is one of: MP3, WAV, FLAC, M4A, OGG, WEBM. Re-encode with `ffmpeg -i input -c:a libmp3lame output.mp3` if needed.
`429`	`rate_limit_exceeded`	You hit concurrency limits. Retry with exponential backoff, or upgrade the plan for higher concurrency.
`503`	`upstream_unavailable`	SpeakEasy's inference upstream is momentarily busy. Safe to retry after 2-3 seconds.

Which languages are supported?

Whisper large-v3 — and by extension SpeakEasy's Python API — supports 99+ languages with automatic detection. Passing an ISO 639-1 code (e.g., language="es") improves accuracy on short clips where detection is unreliable. The Whisper paper reports sub-15% WER on the FLEURS benchmark across roughly 60 of those languages (Radford et al., 2022).

Specify language='auto' for automatic detection, or use the ISO 639-1 code (e.g., 'es' for Spanish, 'fr' for French):

transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=audio_file,
    language="es",  # Spanish — omit or use "auto" for automatic detection
)

Explicit language codes improve accuracy when the audio language is known. For multilingual recordings, omit the parameter entirely and let the model detect automatically.

For domain-specific vocabulary (brand names, technical terms, acronyms), also pass a prompt parameter — see the Whisper prompt guide for patterns that work well.

Complete Example: Meeting Transcriber

Here's a complete, runnable Python script that combines diarization and timestamps to produce formatted meeting notes from any audio file:

from openai import OpenAI, APIError, APIConnectionError
import sys

def transcribe_meeting(file_path: str, api_key: str) -> None:
    client = OpenAI(
        api_key=api_key,
        base_url="https://www.tryspeakeasy.io/api/v1",
    )

    print(f"Transcribing {file_path}...")

    try:
        with open(file_path, "rb") as audio_file:
            transcript = client.audio.transcriptions.create(
                model="whisper-large-v3",
                file=audio_file,
                response_format="verbose_json",
                timestamp_granularities=["segment"],
                extra_body={"diarize": True},
            )
    except FileNotFoundError:
        print(f"Error: File not found — {file_path}")
        sys.exit(1)
    except APIConnectionError:
        print("Error: Could not connect to SpeakEasy API.")
        sys.exit(1)
    except APIError as e:
        print(f"API error {e.status_code}: {e.message}")
        sys.exit(1)

    print("\n--- Meeting Transcript ---\n")

    for segment in transcript.segments:
        start = segment.start
        end = segment.end
        speaker = segment.get("speaker", "Unknown")
        text = segment.text.strip()

        minutes_start = int(start // 60)
        seconds_start = int(start % 60)
        print(f"[{minutes_start:02d}:{seconds_start:02d}] Speaker {speaker}: {text}")

    print("\n--- End of Transcript ---")

if __name__ == "__main__":
    transcribe_meeting("meeting.mp3", "YOUR_API_KEY")

Example output:

Transcribing meeting.mp3...

--- Meeting Transcript ---

[00:00] Speaker 0: Welcome everyone. Let's get the weekly sync started.
[00:05] Speaker 1: Thanks. I wanted to start with the product update from this week.
[00:12] Speaker 0: Go ahead.
[00:13] Speaker 1: We shipped the new dashboard and early feedback is positive.
[00:20] Speaker 0: Great. Any blockers going into next week?
[00:23] Speaker 1: Nothing critical, but we need a design review before the next release.

--- End of Transcript ---

Frequently Asked Questions

Is there a free Python speech-to-text library?

Yes — libraries like openai-whisper and vosk run locally for free, but require a GPU for acceptable speed and produce lower accuracy than hosted models. For production use, a hosted API like SpeakEasy delivers faster, more accurate results without infrastructure overhead, at $0.20/hr.

Can I use OpenAI's Whisper Python library instead?

Yes, but it runs locally and requires significant hardware to match cloud speed. OpenAI's hosted API also lacks built-in speaker diarization. SpeakEasy runs the same Whisper large-v3 model with diarization included, at a lower per-hour rate.

What audio formats does the Python API support?

SpeakEasy accepts MP3, WAV, FLAC, M4A, OGG, WEBM, and several other common formats. For the complete list, see the speech-to-text documentation. Most standard audio and video container formats work without pre-conversion.

How fast is the Python speech-to-text API?

Most files under 60 seconds are transcribed within 10–15 seconds. Longer files scale roughly linearly with duration. For files over 10 minutes, use the async callback_url approach in Step 7 to avoid blocking your application while the transcript is generated.

Ready to get started? Create your SpeakEasy account — $1 first month, 50 hours included — and start transcribing in minutes. Check out the full speech-to-text documentation for the complete parameter reference.

Python Speech-to-Text API: Transcribe Audio in 5 Lines of Code

What is a Python speech-to-text API?

Why use SpeakEasy's Python speech-to-text API?

How does SpeakEasy compare to other Python STT options?

Prerequisites

Step 1: Install the OpenAI SDK

How do I transcribe an audio file in Python?

How do I get word-level timestamps?

How do I identify speakers in multi-speaker audio?

How do I transcribe audio from a URL?

How do I generate SRT subtitles?

How do I transcribe long audio files without blocking?

How do I handle errors in production?

Common error codes and fixes

Which languages are supported?

Complete Example: Meeting Transcriber

Frequently Asked Questions

Is there a free Python speech-to-text library?

Can I use OpenAI's Whisper Python library instead?

What audio formats does the Python API support?

How fast is the Python speech-to-text API?

Keep reading

Voice Activity Detection in Python: The Complete Guide

How to Build an AI Voice Agent with a Speech-to-Text API

Speech-to-Text API in JavaScript: Complete Guide (2026)

$1. 50 hours. Both STT and TTS.