Translate Audio to English in One API Call

Short answer: Set translate=true on any SpeakEasy transcription request and you get English text back, no matter what language was spoken. One API call instead of transcribe-then-translate, one bill, one round-trip — covering 99+ input languages at the same $0.20/hr base rate.

Building a product that handles audio from users around the world? The usual path is: transcribe in the source language, then call a translation API, then process the English text. That's two API calls, two billing accounts, and latency on top of latency.

SpeakEasy collapses it to one. Set translate=true on any transcription request and you get English output, regardless of what language was spoken.

Basic usage

curl -X POST https://www.tryspeakeasy.io/api/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@spanish-meeting.mp3" \
  -F "translate=true"

Response:

{
  "text": "Good morning everyone. Today we're going to review the quarterly results.",
  "duration": 45.2,
  "language": "es"
}

The language field tells you what was detected. The text is always English.

Supported languages

Translation works from any of Whisper's 99+ supported languages, including:

Spanish, French, German, Italian, Portuguese
Japanese, Chinese (Simplified + Traditional), Korean
Arabic, Hebrew, Hindi, Bengali
Russian, Ukrainian, Polish, Czech
Turkish, Indonesian, Vietnamese, Thai
And 90+ more

Language detection is automatic — you don't need to specify the source language, though you can with the language parameter to improve accuracy on shorter clips.

Python example

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_SPEAKEASY_KEY",
    base_url="https://www.tryspeakeasy.io/api/v1"
)

def transcribe_and_translate(audio_path: str) -> dict:
    with open(audio_path, "rb") as f:
        # Use verbose_json to get language detection + segments
        result = client.audio.transcriptions.create(
            model="whisper-large-v3",
            file=f,
            response_format="verbose_json",
            extra_body={"translate": True},
        )
    return {
        "english_text": result.text,
        "source_language": result.language,
        "duration_seconds": result.duration,
    }

# Works with any language
data = transcribe_and_translate("interview-in-mandarin.mp3")
print(f"Detected language: {data['source_language']}")
print(f"English text: {data['english_text']}")

Getting English subtitles from foreign-language video

Combine translate=true with response_format=srt to generate English subtitles from any video:

curl -X POST https://www.tryspeakeasy.io/api/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@french-documentary.mp4" \
  -F "translate=true" \
  -F "response_format=srt" \
  > english-subtitles.srt

One call. English SRT file. No external translation API.

Real-world use cases

Customer support transcription: Your support team speaks English but customers call in multiple languages. Translate on ingest and your agents see English transcripts without extra tooling.

import httpx

def transcribe_support_call(audio_url: str, ticket_id: str) -> str:
    response = httpx.post(
        "https://www.tryspeakeasy.io/api/v1/audio/transcriptions",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        data={
            "url": audio_url,
            "translate": "true",
            "callback_url": f"https://yourapp.com/webhooks/transcripts/{ticket_id}",
        },
        timeout=30,
    )
    response.raise_for_status()
    return response.json()

Multilingual podcast transcription: Transcribe and translate episodes from international podcasters for English-speaking audiences.

Video localization pipeline: Get English text first, then hand it to your localization team to adapt and re-record — skipping an intermediate translation round trip.

Meeting notes for global teams: Participants join in their native language, the transcript arrives in English.

How it differs from separate transcribe + translate

The translate=true approach uses Whisper's built-in translation capability, which was trained end-to-end on speech-to-English pairs. This means:

No error accumulation — the model never produces an intermediate transcript that a second model then translates
Better handling of idiomatic speech that would confuse a text-based translation model
Single API call, single latency budget, single billing line

For production accuracy on specific language pairs, compare results against a dedicated translation API on your data — for most use cases, Whisper's translation is more than sufficient and significantly cheaper.

Async translation for long files

For long recordings, use callback_url to avoid holding a connection open:

curl -X POST https://www.tryspeakeasy.io/api/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "url=https://your-storage.com/lecture-in-german.mp3" \
  -F "translate=true" \
  -F "callback_url=https://yourapp.com/webhooks/transcript"

The webhook payload will contain the English transcript when processing completes.

Translate Audio to English in One API Call

Basic usage

Supported languages

Python example

Getting English subtitles from foreign-language video

Real-world use cases

How it differs from separate transcribe + translate

Async translation for long files

Related reading

Keep reading

Generate SRT and VTT Subtitle Files Directly from Audio

How to Use the Prompt Parameter to Improve Whisper Transcription Accuracy

Async Transcription API with Callback URLs

$1. 50 hours. Both STT and TTS.