Python Speech-to-Text API: Transcribe Audio in 5 Lines of Code
A complete guide to using a speech-to-text API in Python. Install the OpenAI SDK, point it at SpeakEasy, and get transcripts with speaker labels, timestamps, and async processing.
Python Speech-to-Text API: Transcribe Audio in 5 Lines of Code
Last updated: April 3, 2026
Need a reliable python speech to text API? In this tutorial, you'll learn how to transcribe audio files using the SpeakEasy API in just a few lines of Python code. Since SpeakEasy is fully compatible with the OpenAI SDK, there's no new library to learn — just point your existing client at a different URL.
What Is a Speech-to-Text API?
A speech-to-text API converts spoken audio into written text by sending an audio file to a remote service, which returns a transcript. Unlike local libraries, a hosted API handles model updates, GPU infrastructure, and scaling automatically — so developers get accurate transcription without managing hardware.
Why Use SpeakEasy's Python Speech-to-Text API?
SpeakEasy runs OpenAI's Whisper large-v3 model — the same model behind OpenAI's own transcription API — at 44% lower cost. At $0.20 per hour of audio (plan rate), it's one of the most affordable options available without sacrificing accuracy.
Key advantages for Python developers:
- Zero new SDK needed. SpeakEasy is 100% compatible with the OpenAI Python library. Change one line and you're done.
- Speaker diarization built in. Identify who said what without a separate service.
- Word-level timestamps. Get precise timing data for subtitles, search indexing, and video sync.
- 99+ languages supported. Automatic language detection or explicit ISO 639-1 codes.
See the full speech-to-text documentation for the complete parameter reference.
Prerequisites
- Python 3.8 or later
pipfor package installation- A SpeakEasy API key (get one on the pricing page)
- An audio file in a supported format (MP3, WAV, FLAC, M4A, OGG, WEBM, and more)
Step 1: Install the OpenAI SDK
No new SDK needed — SpeakEasy is 100% compatible with the OpenAI Python library. If you've used OpenAI's transcription API before, you already know how to use SpeakEasy.
pip install openai
That's the only dependency. No extra packages, no custom clients.
Step 2: Basic Transcription
Create a file called transcribe.py and add the following:
from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://api.tryspeakeasy.io/v1",
)
with open("audio.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio_file,
)
print(transcript.text)
Run it with python transcribe.py and you'll see your transcript printed to the console. For a 60-second audio file, expect a result in under a second.
Example output:
Thanks for joining today's call. Let's get started with the quarterly review.
That's all it takes for basic transcription. The next steps add timestamps, speaker labels, subtitles, and more.
Step 3: Word-Level Timestamps
Timestamps are essential for subtitles, search indexing, and syncing text to video. They let you jump directly to a specific word in a recording, build click-to-seek players, or create time-coded transcripts for legal or medical use cases.
Add the timestamp_granularities parameter:
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word", "segment"],
)
for segment in transcript.segments:
print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")
Example output:
[0.00s - 2.40s] Thanks for joining today's call.
[2.40s - 5.10s] Let's get started with the quarterly review.
[5.10s - 9.80s] First, I want to walk through the numbers from last month.
Use timestamp_granularities=["word"] if you need per-word timing rather than per-segment.
Step 4: Speaker Diarization
Identify who said what with SpeakEasy's built-in speaker diarization. This is invaluable for meeting notes, interviews, podcast transcripts, and any multi-speaker recording where you need to attribute statements to individuals.
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio_file,
response_format="verbose_json",
extra_body={"diarize": True},
)
for segment in transcript.segments:
speaker = segment.get("speaker", "Unknown")
print(f"Speaker {speaker}: {segment.text}")
Example output showing two speakers in a recorded interview:
Speaker 0: Welcome to the podcast. Can you tell us a bit about your background?
Speaker 1: Sure, I've been working in machine learning for about eight years now.
Speaker 0: That's a great foundation. What drew you to the field originally?
Speaker 1: Honestly, it started with a course on natural language processing in college.
Speaker 0: And how has the field changed since then?
Speaker 1: The shift to transformer architectures changed everything. Especially in the last three years.
Speakers are labeled Speaker 0, Speaker 1, and so on in order of first appearance.
Step 5: Transcribe from a URL
For large files over 25 MB, uploading the raw bytes can be slow or hit request size limits. Instead, pass a publicly accessible URL using the extra_body parameter and SpeakEasy will fetch the file directly:
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=("audio.mp3", b""), # placeholder required by the SDK
extra_body={"url": "https://your-storage-bucket.com/recordings/meeting.mp3"},
)
print(transcript.text)
This is the recommended approach for files stored in S3, Google Cloud Storage, or any public CDN. The file stays in your infrastructure — SpeakEasy fetches it server-side and returns the transcript.
Step 6: Generate SRT Subtitles
SRT is the most widely supported subtitle format, compatible with YouTube, VLC, Final Cut Pro, and virtually every video platform. Setting response_format="srt" returns a ready-to-use subtitle file instead of plain text:
with open("audio.mp3", "rb") as audio_file:
srt_output = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio_file,
response_format="srt",
)
# srt_output is a string — save it directly to a file
with open("subtitles.srt", "w") as f:
f.write(srt_output)
print("SRT file saved to subtitles.srt")
The output is standard SRT format:
1
00:00:00,000 --> 00:00:02,400
Thanks for joining today's call.
2
00:00:02,400 --> 00:00:05,100
Let's get started with the quarterly review.
See the SRT subtitles guide for VTT format and additional subtitle options.
Step 7: Async Transcription for Long Files
For recordings longer than 10 minutes — full meetings, lectures, or long-form interviews — synchronous requests can time out. Use the callback_url parameter to process the file asynchronously. SpeakEasy will POST the completed transcript to your endpoint when it's ready:
with open("long_recording.mp3", "rb") as audio_file:
client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio_file,
extra_body={"callback_url": "https://your-server.com/webhook/transcript"},
)
print("Transcription started. Result will be delivered to your webhook.")
Your webhook endpoint receives a POST request with the full transcript object in the same format as a synchronous response. This keeps your application responsive regardless of audio length.
Step 8: Error Handling
Production code needs proper error handling. Here's a robust pattern that covers the most common failure modes:
from openai import OpenAI, APIError, APIConnectionError
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://api.tryspeakeasy.io/v1",
)
try:
with open("audio.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio_file,
)
print(transcript.text)
except FileNotFoundError:
print("Audio file not found. Check the file path.")
except APIConnectionError:
print("Could not connect to SpeakEasy API. Check your network.")
except APIError as e:
print(f"API error: {e.status_code} - {e.message}")
Supported Languages
SpeakEasy supports 99+ languages powered by Whisper large-v3. Specify language='auto' for automatic detection, or use the ISO 639-1 code (e.g., 'es' for Spanish, 'fr' for French):
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio_file,
language="es", # Spanish — omit or use "auto" for automatic detection
)
Explicit language codes improve accuracy when the audio language is known. For multilingual recordings, omit the parameter entirely and let the model detect automatically.
Complete Example: Meeting Transcriber
Here's a complete, runnable Python script that combines diarization and timestamps to produce formatted meeting notes from any audio file:
from openai import OpenAI, APIError, APIConnectionError
import sys
def transcribe_meeting(file_path: str, api_key: str) -> None:
client = OpenAI(
api_key=api_key,
base_url="https://api.tryspeakeasy.io/v1",
)
print(f"Transcribing {file_path}...")
try:
with open(file_path, "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["segment"],
extra_body={"diarize": True},
)
except FileNotFoundError:
print(f"Error: File not found — {file_path}")
sys.exit(1)
except APIConnectionError:
print("Error: Could not connect to SpeakEasy API.")
sys.exit(1)
except APIError as e:
print(f"API error {e.status_code}: {e.message}")
sys.exit(1)
print("\n--- Meeting Transcript ---\n")
for segment in transcript.segments:
start = segment.start
end = segment.end
speaker = segment.get("speaker", "Unknown")
text = segment.text.strip()
minutes_start = int(start // 60)
seconds_start = int(start % 60)
print(f"[{minutes_start:02d}:{seconds_start:02d}] Speaker {speaker}: {text}")
print("\n--- End of Transcript ---")
if __name__ == "__main__":
transcribe_meeting("meeting.mp3", "YOUR_API_KEY")
Example output:
Transcribing meeting.mp3...
--- Meeting Transcript ---
[00:00] Speaker 0: Welcome everyone. Let's get the weekly sync started.
[00:05] Speaker 1: Thanks. I wanted to start with the product update from this week.
[00:12] Speaker 0: Go ahead.
[00:13] Speaker 1: We shipped the new dashboard and early feedback is positive.
[00:20] Speaker 0: Great. Any blockers going into next week?
[00:23] Speaker 1: Nothing critical, but we need a design review before the next release.
--- End of Transcript ---
Frequently Asked Questions
Is there a free Python speech-to-text library?
Yes — libraries like openai-whisper and vosk run locally for free, but require a GPU for acceptable speed and produce lower accuracy than hosted models. For production use, a hosted API like SpeakEasy delivers faster, more accurate results without infrastructure overhead, at $0.20/hr.
Can I use OpenAI's Whisper Python library instead?
Yes, but it runs locally and requires significant hardware to match cloud speed. OpenAI's hosted API also lacks built-in speaker diarization. SpeakEasy runs the same Whisper large-v3 model with diarization included, at a lower per-hour rate.
What audio formats does the Python API support?
SpeakEasy accepts MP3, WAV, FLAC, M4A, OGG, WEBM, and several other common formats. For the complete list, see the speech-to-text documentation. Most standard audio and video container formats work without pre-conversion.
How fast is the Python speech-to-text API?
Most files under 60 seconds are transcribed in under a second. Longer files scale roughly linearly with duration. For files over 10 minutes, use the async callback_url approach in Step 7 to avoid blocking your application while the transcript is generated.
Ready to get started? Create your free SpeakEasy account and start transcribing in minutes. Check out the full speech-to-text documentation for the complete parameter reference.