·Rapha·5 min read

Speech-to-Text API in JavaScript: Complete Guide (2026)

The complete speech-to-text JavaScript guide — Node, browser, streaming, diarization, error handling, and production patterns. OpenAI SDK compatible via SpeakEasy.

JavaScriptSpeech-to-TextTutorialNode.js

In one sentence: in Node or in the browser, you send audio to a POST /v1/audio/transcriptions endpoint with a Bearer token, and get back text. Everything else — timestamps, diarization, streaming, language hints — is optional parameters on the same request.

This guide covers every real-world pattern you'll hit building a speech-to-text integration in JavaScript: Node backends, Next.js server routes, browser recorders, file uploads, URL-based jobs, streaming, diarization, and the production-grade error handling you won't find in a Stack Overflow answer. All examples use SpeakEasy's OpenAI-compatible API — if you already know the OpenAI SDK, you already know this API.

Prerequisites

  • Node.js 18+ (native fetch, FormData, File) or any modern browser
  • A SpeakEasy API key — grab one for $1
  • Basic familiarity with async/await and Promise

Option 1: OpenAI Node SDK (recommended)

The shortest path. SpeakEasy is fully compatible with the OpenAI SDK — you change the baseURL and the rest of your code works.

npm install openai
import OpenAI from "openai";
import fs from "node:fs";

const client = new OpenAI({
  apiKey: process.env.SPEAKEASY_API_KEY,
  baseURL: "https://www.tryspeakeasy.io/api/v1",
});

const transcript = await client.audio.transcriptions.create({
  model: "whisper-large-v3",
  file: fs.createReadStream("meeting.mp3"),
});

console.log(transcript.text);

Three lines of meaningful code. If you already have an OpenAI-based integration, this is a literal search-and-replace.

TypeScript: the types come for free

The OpenAI SDK ships typings, so your IDE will autocomplete every parameter and transcript.text will be typed as string:

import OpenAI from "openai";
import type { Transcription } from "openai/resources/audio/transcriptions";

const client = new OpenAI({ /* ... */ });

const transcript: Transcription = await client.audio.transcriptions.create({
  model: "whisper-large-v3",
  file: fs.createReadStream("meeting.mp3"),
});

For verbose_json responses with segments, use Transcription.Verbose.

Option 2: Native fetch (no dependency)

If you don't want to install anything — a tiny Edge function, a browser utility, a throwaway script — native fetch works fine.

import fs from "node:fs";

const formData = new FormData();
formData.append("model", "whisper-large-v3");
formData.append("file", new Blob([await fs.promises.readFile("audio.mp3")]), "audio.mp3");

const res = await fetch("https://www.tryspeakeasy.io/api/v1/audio/transcriptions", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.SPEAKEASY_API_KEY}`,
  },
  body: formData,
});

if (!res.ok) throw new Error(`HTTP ${res.status}: ${await res.text()}`);

const { text } = await res.json();
console.log(text);

Note the explicit filename ("audio.mp3") on the Blob — without it, some SDKs send blob as the filename, which some servers reject.

Option 3: Browser (recording mic input)

The most common question: "how do I transcribe audio recorded from the user's microphone?" Pattern:

async function recordAndTranscribe() {
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
  const chunks = [];

  const recorder = new MediaRecorder(stream, { mimeType: "audio/webm" });
  recorder.ondataavailable = e => chunks.push(e.data);

  recorder.start();
  await new Promise(r => setTimeout(r, 5000)); // record 5 seconds
  recorder.stop();
  stream.getTracks().forEach(t => t.stop());

  await new Promise(r => recorder.addEventListener("stop", r, { once: true }));

  const blob = new Blob(chunks, { type: "audio/webm" });
  const fd = new FormData();
  fd.append("model", "whisper-large-v3");
  fd.append("file", blob, "recording.webm");

  // Important: proxy this through YOUR backend, don't expose the API key in the browser.
  const res = await fetch("/api/transcribe", { method: "POST", body: fd });
  const { text } = await res.json();
  return text;
}

Never call the SpeakEasy API directly from a browser with your real API key. Proxy the request through your own server route (below) so the key stays on the server.

Next.js 15/16 route handler (the proxy)

// app/api/transcribe/route.ts
import OpenAI from "openai";

export async function POST(req: Request) {
  const form = await req.formData();
  const file = form.get("file");
  if (!(file instanceof File)) return new Response("Missing file", { status: 400 });

  const client = new OpenAI({
    apiKey: process.env.SPEAKEASY_API_KEY!,
    baseURL: "https://www.tryspeakeasy.io/api/v1",
  });

  const transcript = await client.audio.transcriptions.create({
    model: "whisper-large-v3",
    file,
  });

  return Response.json({ text: transcript.text });
}

This runs server-side, so the API key lives in your environment variables. You can add rate limiting, auth, usage metering, and logging in one place.

Getting timestamps

Subtitle generation, audio search, and video editors all need word- or segment-level timestamps. Request verbose_json:

const transcript = await client.audio.transcriptions.create({
  model: "whisper-large-v3",
  file: fs.createReadStream("interview.mp3"),
  response_format: "verbose_json",
  timestamp_granularities: ["segment", "word"],
});

for (const segment of transcript.segments) {
  console.log(`[${segment.start.toFixed(2)}s] ${segment.text}`);
}

Each word object has { word, start, end, probability } — useful for highlight-as-you-listen UIs.

Generating SRT or WebVTT subtitles

If you want subtitles directly, ask the API for them — no post-processing on your side:

const srt = await client.audio.transcriptions.create({
  model: "whisper-large-v3",
  file: fs.createReadStream("video-audio.mp3"),
  response_format: "srt",
});

fs.writeFileSync("captions.srt", srt);

Works identically for "vtt" — useful for HTML5 <track> tags. Our subtitle API guide covers styling and timing tweaks.

URL-based transcription (skip the upload)

Audio already sitting in S3, R2, GCS, or any public URL? Don't re-upload it — pass the URL:

const res = await fetch("https://www.tryspeakeasy.io/api/v1/audio/transcriptions", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.SPEAKEASY_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "whisper-large-v3",
    url: "https://example.com/podcasts/ep42.mp3",
  }),
});

const { text } = await res.json();

This pattern is a godsend for serverless. The SpeakEasy servers fetch the audio, so your Lambda or Edge function never holds the file in memory.

Streaming transcription (server-sent events)

For live captioning or long-running jobs, stream the response as server-sent events:

const res = await fetch("https://www.tryspeakeasy.io/api/v1/audio/transcriptions", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.SPEAKEASY_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "whisper-large-v3",
    url: "https://example.com/long-recording.mp3",
    stream: true,
  }),
});

const reader = res.body.getReader();
const decoder = new TextDecoder();
let buffer = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  buffer += decoder.decode(value, { stream: true });

  // SSE frames are separated by \n\n
  const frames = buffer.split("\n\n");
  buffer = frames.pop() ?? ""; // last chunk may be incomplete

  for (const frame of frames) {
    if (!frame.startsWith("data: ")) continue;
    const payload = frame.slice(6);
    if (payload === "[DONE]") return;
    const { delta } = JSON.parse(payload);
    process.stdout.write(delta);
  }
}

This is the pattern for live-captioning webinars or typing-as-the-user-speaks UIs.

Speaker diarization (who said what)

Enable diarize: true for multi-speaker audio. The API returns speaker labels on each segment:

const transcript = await client.audio.transcriptions.create({
  model: "whisper-large-v3",
  file: fs.createReadStream("all-hands-meeting.mp3"),
  response_format: "verbose_json",
  // @ts-expect-error — SpeakEasy extension beyond OpenAI's type
  diarize: true,
});

for (const segment of transcript.segments) {
  console.log(`${segment.speaker}: ${segment.text}`);
  // "Speaker 0: Hey, welcome to the call."
  // "Speaker 1: Thanks, glad to be here."
}

SpeakEasy includes diarization on the $10/month plan; OpenAI doesn't offer it at any price. Full walkthrough in the speaker diarization guide.

Language hints (faster, more accurate)

Whisper auto-detects language, but if you know the language, pass it — the model skips detection and goes straight to transcription, saving a few hundred milliseconds:

const transcript = await client.audio.transcriptions.create({
  model: "whisper-large-v3",
  file: fs.createReadStream("spanish-interview.mp3"),
  language: "es",
});

Language codes are ISO 639-1 (en, es, fr, de, ja, zh, …). 99 languages supported.

Translation to English

Any input language → English transcript in one call:

const englishText = await client.audio.translations.create({
  model: "whisper-large-v3",
  file: fs.createReadStream("japanese-podcast.mp3"),
});

console.log(englishText.text); // English output, regardless of source language

Full walkthrough with 99 source languages.

Production error handling

The quick pattern most guides stop at:

try {
  const t = await client.audio.transcriptions.create({ /* ... */ });
  console.log(t.text);
} catch (err) {
  console.error(err);
}

The pattern you actually want in production:

import OpenAI, { APIError } from "openai";

async function transcribeWithRetry(filePath, maxAttempts = 3) {
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await client.audio.transcriptions.create({
        model: "whisper-large-v3",
        file: fs.createReadStream(filePath),
      });
    } catch (err) {
      if (!(err instanceof APIError)) throw err;

      if (err.status === 401) throw new Error("Invalid API key.");
      if (err.status === 413) throw new Error("File > 100 MB. Split it first.");
      if (err.status === 400) throw err; // bad audio format — don't retry

      // 429 (rate limit) and 5xx are worth retrying with exponential backoff
      if (attempt === maxAttempts) throw err;
      const retryAfter = Number(err.headers?.["retry-after"]) || 2 ** attempt;
      await new Promise(r => setTimeout(r, retryAfter * 1000));
    }
  }
}

Key behaviors:

  • Don't retry 4xx errors (except 429). They're client-side problems that retries won't fix.
  • Honor Retry-After on 429. SpeakEasy sends this header, and your retry will be rejected again if you ignore it.
  • Cap retries. Infinite retry loops bankrupt you if the error is actually deterministic.
  • Log the request_id from the response headers so you can correlate with server logs when you file a support ticket.

Async + webhook for long files

Audio over ~30 min is a poor fit for sync requests — you hold a socket open forever. Submit async, get a webhook when it's done:

const job = await fetch("https://www.tryspeakeasy.io/api/v1/audio/transcriptions", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.SPEAKEASY_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "whisper-large-v3",
    url: "https://s3.example.com/court-hearing.mp3",
    async: true,
    webhook_url: "https://your-app.com/webhooks/transcripts",
  }),
});

const { id } = await job.json();
console.log(`Submitted ${id}, webhook will fire when ready.`);

Your webhook receives a POST with the full transcript. Perfect for batch pipelines processing 2-hour+ recordings.

Rate limits & concurrency

Default limits on the $10/month plan:

  • 500 requests per minute (10x OpenAI's default of 50 RPM)
  • 10 concurrent requests per account

For higher volume, contact support — we can provision 2,000+ RPM within a day. If you're already hitting rate limits, see real-cost-speech-to-text-apis for whether the cost of the provider or your architecture is the bottleneck.

Supported file formats

.mp3, .wav, .flac, .aac, .opus, .ogg, .m4a, .mp4, .mpeg, .mov, .webm. 100 MB upload cap (vs OpenAI's 25 MB). For live-captured audio, encode to opus or webm — both are browser-native and small.

Frequently Asked Questions

Does openai npm package work with SpeakEasy?

Yes. Any version ≥ 4.0 of the openai package works. Set baseURL when constructing the client and your existing code runs unchanged.

Can I call the API from a browser directly?

You can, but you shouldn't. Exposing an API key in client-side JavaScript lets anyone with DevTools steal and burn your quota. Always proxy through a backend route.

What's the difference between transcriptions and translations?

transcriptions.create returns text in the source language. translations.create returns English regardless of input language. Both accept the same file/URL/model parameters.

Does SpeakEasy support real-time (WebSocket) streaming?

Not via WebSocket — we use HTTP server-sent events (SSE), same as OpenAI. For near-real-time UIs, chunk the audio into 2-5 second segments and transcribe each chunk individually.

How do I handle partial failures in batch pipelines?

Log the request ID from the response headers, capture failed inputs in a dead-letter queue, and retry after fixing the underlying issue (wrong format, corrupted file, exceeded rate limit). Don't retry blind — you'll burn quota.

Related reading


Start for $1 — 50 hours of audio, no credit-card trap, cancel from the dashboard whenever. Ship in an afternoon.

Keep reading

$1. 50 hours. Both STT and TTS.

Your current speech API provider is charging you too much. Switch in one line of code.