Speech-to-Text API in JavaScript: Complete Guide (2026)
The complete speech-to-text JavaScript guide — Node, browser, streaming, diarization, error handling, and production patterns. OpenAI SDK compatible via SpeakEasy.
In one sentence: in Node or in the browser, you send audio to a POST /v1/audio/transcriptions endpoint with a Bearer token, and get back text. Everything else — timestamps, diarization, streaming, language hints — is optional parameters on the same request.
This guide covers every real-world pattern you'll hit building a speech-to-text integration in JavaScript: Node backends, Next.js server routes, browser recorders, file uploads, URL-based jobs, streaming, diarization, and the production-grade error handling you won't find in a Stack Overflow answer. All examples use SpeakEasy's OpenAI-compatible API — if you already know the OpenAI SDK, you already know this API.
Prerequisites
- Node.js 18+ (native
fetch,FormData,File) or any modern browser - A SpeakEasy API key — grab one for $1
- Basic familiarity with
async/awaitandPromise
Option 1: OpenAI Node SDK (recommended)
The shortest path. SpeakEasy is fully compatible with the OpenAI SDK — you change the baseURL and the rest of your code works.
npm install openai
import OpenAI from "openai";
import fs from "node:fs";
const client = new OpenAI({
apiKey: process.env.SPEAKEASY_API_KEY,
baseURL: "https://www.tryspeakeasy.io/api/v1",
});
const transcript = await client.audio.transcriptions.create({
model: "whisper-large-v3",
file: fs.createReadStream("meeting.mp3"),
});
console.log(transcript.text);
Three lines of meaningful code. If you already have an OpenAI-based integration, this is a literal search-and-replace.
TypeScript: the types come for free
The OpenAI SDK ships typings, so your IDE will autocomplete every parameter and transcript.text will be typed as string:
import OpenAI from "openai";
import type { Transcription } from "openai/resources/audio/transcriptions";
const client = new OpenAI({ /* ... */ });
const transcript: Transcription = await client.audio.transcriptions.create({
model: "whisper-large-v3",
file: fs.createReadStream("meeting.mp3"),
});
For verbose_json responses with segments, use Transcription.Verbose.
Option 2: Native fetch (no dependency)
If you don't want to install anything — a tiny Edge function, a browser utility, a throwaway script — native fetch works fine.
import fs from "node:fs";
const formData = new FormData();
formData.append("model", "whisper-large-v3");
formData.append("file", new Blob([await fs.promises.readFile("audio.mp3")]), "audio.mp3");
const res = await fetch("https://www.tryspeakeasy.io/api/v1/audio/transcriptions", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.SPEAKEASY_API_KEY}`,
},
body: formData,
});
if (!res.ok) throw new Error(`HTTP ${res.status}: ${await res.text()}`);
const { text } = await res.json();
console.log(text);
Note the explicit filename ("audio.mp3") on the Blob — without it, some SDKs send blob as the filename, which some servers reject.
Option 3: Browser (recording mic input)
The most common question: "how do I transcribe audio recorded from the user's microphone?" Pattern:
async function recordAndTranscribe() {
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const chunks = [];
const recorder = new MediaRecorder(stream, { mimeType: "audio/webm" });
recorder.ondataavailable = e => chunks.push(e.data);
recorder.start();
await new Promise(r => setTimeout(r, 5000)); // record 5 seconds
recorder.stop();
stream.getTracks().forEach(t => t.stop());
await new Promise(r => recorder.addEventListener("stop", r, { once: true }));
const blob = new Blob(chunks, { type: "audio/webm" });
const fd = new FormData();
fd.append("model", "whisper-large-v3");
fd.append("file", blob, "recording.webm");
// Important: proxy this through YOUR backend, don't expose the API key in the browser.
const res = await fetch("/api/transcribe", { method: "POST", body: fd });
const { text } = await res.json();
return text;
}
Never call the SpeakEasy API directly from a browser with your real API key. Proxy the request through your own server route (below) so the key stays on the server.
Next.js 15/16 route handler (the proxy)
// app/api/transcribe/route.ts
import OpenAI from "openai";
export async function POST(req: Request) {
const form = await req.formData();
const file = form.get("file");
if (!(file instanceof File)) return new Response("Missing file", { status: 400 });
const client = new OpenAI({
apiKey: process.env.SPEAKEASY_API_KEY!,
baseURL: "https://www.tryspeakeasy.io/api/v1",
});
const transcript = await client.audio.transcriptions.create({
model: "whisper-large-v3",
file,
});
return Response.json({ text: transcript.text });
}
This runs server-side, so the API key lives in your environment variables. You can add rate limiting, auth, usage metering, and logging in one place.
Getting timestamps
Subtitle generation, audio search, and video editors all need word- or segment-level timestamps. Request verbose_json:
const transcript = await client.audio.transcriptions.create({
model: "whisper-large-v3",
file: fs.createReadStream("interview.mp3"),
response_format: "verbose_json",
timestamp_granularities: ["segment", "word"],
});
for (const segment of transcript.segments) {
console.log(`[${segment.start.toFixed(2)}s] ${segment.text}`);
}
Each word object has { word, start, end, probability } — useful for highlight-as-you-listen UIs.
Generating SRT or WebVTT subtitles
If you want subtitles directly, ask the API for them — no post-processing on your side:
const srt = await client.audio.transcriptions.create({
model: "whisper-large-v3",
file: fs.createReadStream("video-audio.mp3"),
response_format: "srt",
});
fs.writeFileSync("captions.srt", srt);
Works identically for "vtt" — useful for HTML5 <track> tags. Our subtitle API guide covers styling and timing tweaks.
URL-based transcription (skip the upload)
Audio already sitting in S3, R2, GCS, or any public URL? Don't re-upload it — pass the URL:
const res = await fetch("https://www.tryspeakeasy.io/api/v1/audio/transcriptions", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.SPEAKEASY_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "whisper-large-v3",
url: "https://example.com/podcasts/ep42.mp3",
}),
});
const { text } = await res.json();
This pattern is a godsend for serverless. The SpeakEasy servers fetch the audio, so your Lambda or Edge function never holds the file in memory.
Streaming transcription (server-sent events)
For live captioning or long-running jobs, stream the response as server-sent events:
const res = await fetch("https://www.tryspeakeasy.io/api/v1/audio/transcriptions", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.SPEAKEASY_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "whisper-large-v3",
url: "https://example.com/long-recording.mp3",
stream: true,
}),
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// SSE frames are separated by \n\n
const frames = buffer.split("\n\n");
buffer = frames.pop() ?? ""; // last chunk may be incomplete
for (const frame of frames) {
if (!frame.startsWith("data: ")) continue;
const payload = frame.slice(6);
if (payload === "[DONE]") return;
const { delta } = JSON.parse(payload);
process.stdout.write(delta);
}
}
This is the pattern for live-captioning webinars or typing-as-the-user-speaks UIs.
Speaker diarization (who said what)
Enable diarize: true for multi-speaker audio. The API returns speaker labels on each segment:
const transcript = await client.audio.transcriptions.create({
model: "whisper-large-v3",
file: fs.createReadStream("all-hands-meeting.mp3"),
response_format: "verbose_json",
// @ts-expect-error — SpeakEasy extension beyond OpenAI's type
diarize: true,
});
for (const segment of transcript.segments) {
console.log(`${segment.speaker}: ${segment.text}`);
// "Speaker 0: Hey, welcome to the call."
// "Speaker 1: Thanks, glad to be here."
}
SpeakEasy includes diarization on the $10/month plan; OpenAI doesn't offer it at any price. Full walkthrough in the speaker diarization guide.
Language hints (faster, more accurate)
Whisper auto-detects language, but if you know the language, pass it — the model skips detection and goes straight to transcription, saving a few hundred milliseconds:
const transcript = await client.audio.transcriptions.create({
model: "whisper-large-v3",
file: fs.createReadStream("spanish-interview.mp3"),
language: "es",
});
Language codes are ISO 639-1 (en, es, fr, de, ja, zh, …). 99 languages supported.
Translation to English
Any input language → English transcript in one call:
const englishText = await client.audio.translations.create({
model: "whisper-large-v3",
file: fs.createReadStream("japanese-podcast.mp3"),
});
console.log(englishText.text); // English output, regardless of source language
Full walkthrough with 99 source languages.
Production error handling
The quick pattern most guides stop at:
try {
const t = await client.audio.transcriptions.create({ /* ... */ });
console.log(t.text);
} catch (err) {
console.error(err);
}
The pattern you actually want in production:
import OpenAI, { APIError } from "openai";
async function transcribeWithRetry(filePath, maxAttempts = 3) {
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return await client.audio.transcriptions.create({
model: "whisper-large-v3",
file: fs.createReadStream(filePath),
});
} catch (err) {
if (!(err instanceof APIError)) throw err;
if (err.status === 401) throw new Error("Invalid API key.");
if (err.status === 413) throw new Error("File > 100 MB. Split it first.");
if (err.status === 400) throw err; // bad audio format — don't retry
// 429 (rate limit) and 5xx are worth retrying with exponential backoff
if (attempt === maxAttempts) throw err;
const retryAfter = Number(err.headers?.["retry-after"]) || 2 ** attempt;
await new Promise(r => setTimeout(r, retryAfter * 1000));
}
}
}
Key behaviors:
- Don't retry 4xx errors (except 429). They're client-side problems that retries won't fix.
- Honor
Retry-Afteron 429. SpeakEasy sends this header, and your retry will be rejected again if you ignore it. - Cap retries. Infinite retry loops bankrupt you if the error is actually deterministic.
- Log the
request_idfrom the response headers so you can correlate with server logs when you file a support ticket.
Async + webhook for long files
Audio over ~30 min is a poor fit for sync requests — you hold a socket open forever. Submit async, get a webhook when it's done:
const job = await fetch("https://www.tryspeakeasy.io/api/v1/audio/transcriptions", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.SPEAKEASY_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "whisper-large-v3",
url: "https://s3.example.com/court-hearing.mp3",
async: true,
webhook_url: "https://your-app.com/webhooks/transcripts",
}),
});
const { id } = await job.json();
console.log(`Submitted ${id}, webhook will fire when ready.`);
Your webhook receives a POST with the full transcript. Perfect for batch pipelines processing 2-hour+ recordings.
Rate limits & concurrency
Default limits on the $10/month plan:
- 500 requests per minute (10x OpenAI's default of 50 RPM)
- 10 concurrent requests per account
For higher volume, contact support — we can provision 2,000+ RPM within a day. If you're already hitting rate limits, see real-cost-speech-to-text-apis for whether the cost of the provider or your architecture is the bottleneck.
Supported file formats
.mp3, .wav, .flac, .aac, .opus, .ogg, .m4a, .mp4, .mpeg, .mov, .webm. 100 MB upload cap (vs OpenAI's 25 MB). For live-captured audio, encode to opus or webm — both are browser-native and small.
Frequently Asked Questions
Does openai npm package work with SpeakEasy?
Yes. Any version ≥ 4.0 of the openai package works. Set baseURL when constructing the client and your existing code runs unchanged.
Can I call the API from a browser directly?
You can, but you shouldn't. Exposing an API key in client-side JavaScript lets anyone with DevTools steal and burn your quota. Always proxy through a backend route.
What's the difference between transcriptions and translations?
transcriptions.create returns text in the source language. translations.create returns English regardless of input language. Both accept the same file/URL/model parameters.
Does SpeakEasy support real-time (WebSocket) streaming?
Not via WebSocket — we use HTTP server-sent events (SSE), same as OpenAI. For near-real-time UIs, chunk the audio into 2-5 second segments and transcribe each chunk individually.
How do I handle partial failures in batch pipelines?
Log the request ID from the response headers, capture failed inputs in a dead-letter queue, and retry after fixing the underlying issue (wrong format, corrupted file, exceeded rate limit). Don't retry blind — you'll burn quota.
Related reading
- Best Speech-to-Text APIs in 2026 — full benchmark of 8 providers on WER and price
- Python Speech-to-Text API — the sibling guide for Python
- Speaker Diarization Guide — full walkthrough of
diarize: true - SRT/VTT Subtitle API — generating subtitles directly from audio
- Async Transcription API — the webhook pattern in depth
- Whisper API Alternative — the shopping-intent comparison with OpenAI
Start for $1 — 50 hours of audio, no credit-card trap, cancel from the dashboard whenever. Ship in an afternoon.