Speech-to-Text API

Transcribe audio files into text with state-of-the-art accuracy. Supports 99+ languages, speaker diarization, word-level timestamps, and translation to English. Transcribes 30 minutes of audio in under one minute.

Endpoint

POST https://www.tryspeakeasy.io/api/v1/audio/transcriptions

Authentication

All requests require a Bearer token in the Authorization header. You can generate API keys from your dashboard.

Authorization: Bearer YOUR_API_KEY

Request Parameters

The request body must be sent as multipart/form-data.

Parameter	Type	Required	Description
`file`	binary	Required	The audio file to transcribe. Maximum file size is 100 MB. See supported formats below.
`model`	string	Optional	The transcription model to use. Default: `whisper-large-v3`.
`language`	string	Optional	The language of the audio as a full language name (e.g., `english`, `french`, `german`). If omitted, the language is auto-detected. Supplying the language improves accuracy and latency. See supported languages below.
`response_format`	string	Optional	The output format. Default: `json`. Accepted values: `json`, `verbose_json`, `text`, `srt`, `vtt`.
`speaker_labels`	boolean	Optional	Enable speaker diarization. When `true`, the response includes a `speakers` array identifying who said what. Maximum number of speakers is limited to 4. Default: `false`.
`word_timestamps`	boolean	Optional	Include word-level timing information. When `true`, the response includes a `words` array with start and end times for each word. Default: `false`.
`translate`	boolean	Optional	Translate the transcription output to English. The source audio can be in any supported language. Default: `false`.
`url`	string	Optional	A publicly accessible URL pointing to the audio file. Use this instead of `file` for larger files. Maximum file size via URL is 1 GB.
`prompt`	string	Optional	A text hint to guide the transcription style. Useful for fixing acronyms (e.g., `NFT, DeFi, DAO`), preserving punctuation, or keeping filler words. The prompt should be in the same language as the audio.
`callback_url`	string	Optional	A URL to receive the transcription result via POST when processing is complete. Useful for long audio files or asynchronous workflows. When provided, the API returns `202 Accepted` immediately.

Code Examples

cURL

curl -X POST https://www.tryspeakeasy.io/api/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@recording.mp3" \
  -F "model=whisper-large-v3" \
  -F "response_format=verbose_json" \
  -F "speaker_labels=true" \
  -F "word_timestamps=true" \
  -F "language=english"

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://www.tryspeakeasy.io/api/v1"
)

# Basic transcription
transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=open("recording.mp3", "rb"),
    response_format="verbose_json"
)

print(transcript.text)
print(f"Duration: {transcript.duration}s")
print(f"Language: {transcript.language}")

JavaScript

const formData = new FormData();
formData.append('file', audioFile);
formData.append('model', 'whisper-large-v3');
formData.append('response_format', 'verbose_json');
formData.append('speaker_labels', 'true');
formData.append('word_timestamps', 'true');

const response = await fetch('https://www.tryspeakeasy.io/api/v1/audio/transcriptions', {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${apiKey}`,
  },
  body: formData,
});

const result = await response.json();
console.log(result.text);
console.log(result.segments);
console.log(result.words);

Transcription via URL

Instead of uploading a file directly, you can pass a publicly accessible URL pointing to the audio. This is especially useful for large files — URL-based transcription supports files up to 1 GB (compared to 100 MB for direct uploads).

curl -X POST https://www.tryspeakeasy.io/api/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "url=https://example.com/files/meeting-recording.mp3" \
  -F "model=whisper-large-v3" \
  -F "response_format=verbose_json" \
  -F "speaker_labels=true"

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://www.tryspeakeasy.io/api/v1",
)

transcript = client.audio.transcriptions.create(
    file="https://example.com/files/meeting-recording.mp3",
    model="whisper-large-v3",
    response_format="verbose_json",
)
print(transcript.text)

const formData = new FormData();
formData.append("url", "https://example.com/files/meeting-recording.mp3");
formData.append("model", "whisper-large-v3");
formData.append("response_format", "verbose_json");

const response = await fetch("https://www.tryspeakeasy.io/api/v1/audio/transcriptions", {
  method: "POST",
  headers: { Authorization: `Bearer ${apiKey}` },
  body: formData,
});

const result = await response.json();
console.log(result.text);

Async Transcription (Callback URL)

For long audio files that take a while to transcribe, you can provide a callback_url instead of waiting for the API to finish processing. The API returns 202 Accepted immediately, then sends a POST request to your callback URL with the transcription result when it is ready.

This frees your client from holding a connection open and avoids timeouts on very long recordings.

Example Request

curl -X POST https://www.tryspeakeasy.io/api/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@long-recording.mp3" \
  -F "model=whisper-large-v3" \
  -F "response_format=verbose_json" \
  -F "callback_url=https://your-server.com/webhooks/transcription"

Callback Payload

When processing is complete, the API sends a POST request to your callback URL. The request body contains the transcription result in the same format as a synchronous response:

{
  "text": "Hello, welcome to the meeting...",
  "language": "en",
  "duration": 1823.45,
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 2.16,
      "text": "Hello, welcome to the meeting."
    }
  ]
}

Using the Prompt Parameter

The prompt parameter lets you guide the transcription model. It is particularly useful when your audio contains domain-specific terms or when you want to control the output style. The prompt should be in the same language as the audio.

Fix domain-specific terms and acronyms

If the model misrecognizes specialized terms, provide them in the prompt:

-F "prompt=NFT, DeFi, DAO, HODL, Ethereum, Solana"

Preserve filler words

By default the model may omit filler words. Include them in the prompt to keep them in the output:

-F "prompt=Um, uh, like, you know"

Continue a previous transcript

When transcribing audio in chunks, pass the end of the previous transcript as the prompt. This helps the model maintain context and consistent punctuation:

-F "prompt=...and that concludes our discussion on the quarterly results."

Response Schema

When response_format is set to verbose_json, the response includes the full transcription result with segments, word-level timestamps, and speaker labels (if enabled).

Field	Type	Description
`text`	string	The full transcription text.
`language`	string	The detected or specified language as an ISO 639-1 code.
`duration`	number	The duration of the audio file in seconds.
`segments`	array	An array of transcript segments. Each segment contains: `id`, `start`, `end`, `text`.
`words`	array	Included when `word_timestamps` is `true`. Each entry contains: `word`, `start`, `end`.
`speakers`	array	Included when `speaker_labels` is `true`. Each entry contains: `speaker`, `start`, `end`, `text`.

Example Response

A verbose_json response with word_timestamps and speaker_labels enabled:

{
  "text": "Hello, welcome to the meeting. Thank you for joining us today.",
  "language": "en",
  "duration": 5.42,
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 2.16,
      "text": "Hello, welcome to the meeting."
    },
    {
      "id": 1,
      "start": 2.48,
      "end": 5.42,
      "text": "Thank you for joining us today."
    }
  ],
  "words": [
    { "word": "Hello,", "start": 0.0, "end": 0.42 },
    { "word": "welcome", "start": 0.44, "end": 0.82 },
    { "word": "to", "start": 0.84, "end": 0.96 },
    { "word": "the", "start": 0.98, "end": 1.08 },
    { "word": "meeting.", "start": 1.10, "end": 1.62 },
    { "word": "Thank", "start": 2.48, "end": 2.78 },
    { "word": "you", "start": 2.80, "end": 2.96 },
    { "word": "for", "start": 2.98, "end": 3.14 },
    { "word": "joining", "start": 3.16, "end": 3.58 },
    { "word": "us", "start": 3.60, "end": 3.78 },
    { "word": "today.", "start": 3.80, "end": 4.32 }
  ],
  "speakers": [
    {
      "speaker": "SPEAKER_00",
      "start": 0.0,
      "end": 2.16,
      "text": "Hello, welcome to the meeting."
    },
    {
      "speaker": "SPEAKER_01",
      "start": 2.48,
      "end": 5.42,
      "text": "Thank you for joining us today."
    }
  ]
}

Supported Audio Formats

The following audio and video formats are accepted:

.mp3.wav.flac.aac.opus.ogg.m4a.mp4.mpeg.mov.webm

Maximum 100 MB per file.

Supported Languages

Pass one of the following values as the language parameter. 99 languages are supported:

englishchinesegermanspanishrussiankoreanfrenchjapaneseportugueseturkishpolishcatalandutcharabicswedishitalianindonesianhindifinnishvietnamesehebrewukrainiangreekmalayczechromaniandanishhungariantamilnorwegianthaiurducroatianbulgarianlithuanianlatinmaorimalayalamwelshslovaktelugupersianlatvianbengaliserbianazerbaijanisloveniankannadaestonianmacedonianbretonbasqueicelandicarmeniannepalimongolianbosniankazakhalbanianswahiligalicianmarathipunjabisinhalakhmershonayorubasomaliafrikaansoccitangeorgianbelarusiantajiksindhigujaratiamharicyiddishlaouzbekfaroesehaitian creolepashtoturkmennynorskmaltesesanskritluxembourgishmyanmartibetantagalogmalagasyassamesetatarhawaiianlingalahausabashkirjavanesesundanesecantoneseburmese

Error Responses

The API returns standard HTTP error codes with a JSON body describing the error. Common errors for this endpoint include:

Status Code	Error	Description
400	Bad Request	Missing required `file` parameter, unsupported format, or invalid parameter value.
401	Unauthorized	Invalid or missing API key.
413	Payload Too Large	File exceeds the 100 MB upload limit.
429	Too Many Requests	Rate limit exceeded. See Rate Limits.

For a full list of error codes and troubleshooting guidance, see the Error Codes reference.