Speech-to-Text API

Transcribe audio files into text with state-of-the-art accuracy. Supports 99+ languages, speaker diarization, word-level timestamps, and translation to English. Transcribes 30 minutes of audio in under one minute.

Endpoint

POST https://api.tryspeakeasy.io/v1/audio/transcriptions

Authentication

All requests require a Bearer token in the Authorization header. You can generate API keys from your dashboard.

Authorization: Bearer YOUR_API_KEY

Request Parameters

The request body must be sent as multipart/form-data.

ParameterTypeRequiredDescription
filebinaryRequiredThe audio file to transcribe. Maximum file size is 25 MB. See supported formats below.
modelstringOptionalThe transcription model to use. Default: whisper-large-v3.
languagestringOptionalThe language of the audio as a full language name (e.g., english, french, german). If omitted, the language is auto-detected. Supplying the language improves accuracy and latency. See supported languages below.
response_formatstringOptionalThe output format. Default: json. Accepted values: json, verbose_json, text, srt, vtt.
speaker_labelsbooleanOptionalEnable speaker diarization. When true, the response includes a speakers array identifying who said what. Maximum number of speakers is limited to 4. Default: false.
word_timestampsbooleanOptionalInclude word-level timing information. When true, the response includes a words array with start and end times for each word. Default: false.
translatebooleanOptionalTranslate the transcription output to English. The source audio can be in any supported language. Default: false.
urlstringOptionalA publicly accessible URL pointing to the audio file. Use this instead of file for larger files. Maximum file size via URL is 1 GB.
promptstringOptionalA text hint to guide the transcription style. Useful for fixing acronyms (e.g., NFT, DeFi, DAO), preserving punctuation, or keeping filler words. The prompt should be in the same language as the audio.
callback_urlstringOptionalA URL to receive the transcription result via POST when processing is complete. Useful for long audio files or asynchronous workflows. When provided, the API returns 202 Accepted immediately.

Code Examples

cURL

curl -X POST https://api.tryspeakeasy.io/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@recording.mp3" \
  -F "model=whisper-large-v3" \
  -F "response_format=verbose_json" \
  -F "speaker_labels=true" \
  -F "word_timestamps=true" \
  -F "language=english"

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.tryspeakeasy.io/v1"
)

# Basic transcription
transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=open("recording.mp3", "rb"),
    response_format="verbose_json"
)

print(transcript.text)
print(f"Duration: {transcript.duration}s")
print(f"Language: {transcript.language}")

JavaScript

const formData = new FormData();
formData.append('file', audioFile);
formData.append('model', 'whisper-large-v3');
formData.append('response_format', 'verbose_json');
formData.append('speaker_labels', 'true');
formData.append('word_timestamps', 'true');

const response = await fetch('https://api.tryspeakeasy.io/v1/audio/transcriptions', {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${apiKey}`,
  },
  body: formData,
});

const result = await response.json();
console.log(result.text);
console.log(result.segments);
console.log(result.words);

Transcription via URL

Instead of uploading a file directly, you can pass a publicly accessible URL pointing to the audio. This is especially useful for large files — URL-based transcription supports files up to 1 GB (compared to 25 MB for direct uploads).

curl -X POST https://api.tryspeakeasy.io/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "url=https://example.com/files/meeting-recording.mp3" \
  -F "model=whisper-large-v3" \
  -F "response_format=verbose_json" \
  -F "speaker_labels=true"

Async Transcription (Callback URL)

For long audio files that take a while to transcribe, you can provide a callback_url instead of waiting for the API to finish processing. The API returns 202 Accepted immediately, then sends a POST request to your callback URL with the transcription result when it is ready.

This frees your client from holding a connection open and avoids timeouts on very long recordings.

Example Request

curl -X POST https://api.tryspeakeasy.io/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@long-recording.mp3" \
  -F "model=whisper-large-v3" \
  -F "response_format=verbose_json" \
  -F "callback_url=https://your-server.com/webhooks/transcription"

Callback Payload

When processing is complete, the API sends a POST request to your callback URL. The request body contains the transcription result in the same format as a synchronous response:

{
  "text": "Hello, welcome to the meeting...",
  "language": "en",
  "duration": 1823.45,
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 2.16,
      "text": "Hello, welcome to the meeting."
    }
  ]
}

Using the Prompt Parameter

The prompt parameter lets you guide the transcription model. It is particularly useful when your audio contains domain-specific terms or when you want to control the output style. The prompt should be in the same language as the audio.

Fix domain-specific terms and acronyms

If the model misrecognizes specialized terms, provide them in the prompt:

-F "prompt=NFT, DeFi, DAO, HODL, Ethereum, Solana"

Preserve filler words

By default the model may omit filler words. Include them in the prompt to keep them in the output:

-F "prompt=Um, uh, like, you know"

Continue a previous transcript

When transcribing audio in chunks, pass the end of the previous transcript as the prompt. This helps the model maintain context and consistent punctuation:

-F "prompt=...and that concludes our discussion on the quarterly results."

Response Schema

When response_format is set to verbose_json, the response includes the full transcription result with segments, word-level timestamps, and speaker labels (if enabled).

FieldTypeDescription
textstringThe full transcription text.
languagestringThe detected or specified language as an ISO 639-1 code.
durationnumberThe duration of the audio file in seconds.
segmentsarrayAn array of transcript segments. Each segment contains: id, start, end, text.
wordsarrayIncluded when word_timestamps is true. Each entry contains: word, start, end.
speakersarrayIncluded when speaker_labels is true. Each entry contains: speaker, start, end, text.

Example Response

A verbose_json response with word_timestamps and speaker_labels enabled:

{
  "text": "Hello, welcome to the meeting. Thank you for joining us today.",
  "language": "en",
  "duration": 5.42,
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 2.16,
      "text": "Hello, welcome to the meeting."
    },
    {
      "id": 1,
      "start": 2.48,
      "end": 5.42,
      "text": "Thank you for joining us today."
    }
  ],
  "words": [
    { "word": "Hello,", "start": 0.0, "end": 0.42 },
    { "word": "welcome", "start": 0.44, "end": 0.82 },
    { "word": "to", "start": 0.84, "end": 0.96 },
    { "word": "the", "start": 0.98, "end": 1.08 },
    { "word": "meeting.", "start": 1.10, "end": 1.62 },
    { "word": "Thank", "start": 2.48, "end": 2.78 },
    { "word": "you", "start": 2.80, "end": 2.96 },
    { "word": "for", "start": 2.98, "end": 3.14 },
    { "word": "joining", "start": 3.16, "end": 3.58 },
    { "word": "us", "start": 3.60, "end": 3.78 },
    { "word": "today.", "start": 3.80, "end": 4.32 }
  ],
  "speakers": [
    {
      "speaker": "SPEAKER_00",
      "start": 0.0,
      "end": 2.16,
      "text": "Hello, welcome to the meeting."
    },
    {
      "speaker": "SPEAKER_01",
      "start": 2.48,
      "end": 5.42,
      "text": "Thank you for joining us today."
    }
  ]
}

Supported Audio Formats

The following audio and video formats are accepted:

.mp3.wav.flac.aac.opus.ogg.m4a.mp4.mpeg.mov.webm

Maximum 25 MB via direct file upload. Maximum 1 GB via URL.

Supported Languages

Pass one of the following values as the language parameter. 99 languages are supported:

englishchinesegermanspanishrussiankoreanfrenchjapaneseportugueseturkishpolishcatalandutcharabicswedishitalianindonesianhindifinnishvietnamesehebrewukrainiangreekmalayczechromaniandanishhungariantamilnorwegianthaiurducroatianbulgarianlithuanianlatinmaorimalayalamwelshslovaktelugupersianlatvianbengaliserbianazerbaijanisloveniankannadaestonianmacedonianbretonbasqueicelandicarmeniannepalimongolianbosniankazakhalbanianswahiligalicianmarathipunjabisinhalakhmershonayorubasomaliafrikaansoccitangeorgianbelarusiantajiksindhigujaratiamharicyiddishlaouzbekfaroesehaitian creolepashtoturkmennynorskmaltesesanskritluxembourgishmyanmartibetantagalogmalagasyassamesetatarhawaiianlingalahausabashkirjavanesesundanesecantoneseburmese

Error Responses

The API returns standard HTTP error codes with a JSON body describing the error. Common errors for this endpoint include:

Status CodeErrorDescription
400Bad RequestMissing required file parameter, unsupported format, or invalid parameter value.
401UnauthorizedInvalid or missing API key.
413Payload Too LargeFile exceeds the 25 MB upload limit (or 1 GB via URL).
429Too Many RequestsRate limit exceeded. See Rate Limits.

For a full list of error codes and troubleshooting guidance, see the Error Codes reference.

$1. 50 hours. Both STT and TTS.

Your current speech API provider is charging you too much. Switch in one line of code.

SPEAKY