·SpeakEasy Team

Speaker Diarization: Identify Who Said What in Audio

Learn what speaker diarization is, why it matters, and how to use the SpeakEasy API to automatically identify speakers in audio recordings with code examples.

Speech-to-TextDiarizationTutorial

Speaker Diarization: Identify Who Said What in Audio

Standard transcription gives you the words. Speaker diarization tells you who said them. If you've ever needed to turn a meeting recording into attributed notes or analyze a customer support call, diarization is the feature that makes transcription truly useful.

What Is Speaker Diarization?

Speaker diarization is the process of partitioning an audio stream into segments based on who is speaking. The output labels each segment with a speaker identifier (Speaker 0, Speaker 1, etc.), so you can see the conversation flow attributed to each participant.

It answers the question: "Who spoke when?"

Use Cases

  • Meeting transcripts: Generate notes where each person's contributions are clearly labeled.
  • Interview processing: Separate interviewer questions from candidate answers automatically.
  • Podcast editing: Identify host vs. guest segments for editing or show notes.
  • Call center analytics: Distinguish agent from customer to analyze talk ratios and sentiment.
  • Legal depositions: Create attributed records of who said what during recorded proceedings.
  • Medical dictation: Separate doctor remarks from patient responses.

How to Enable Diarization with SpeakEasy

SpeakEasy's speech-to-text API includes diarization as a built-in feature. No second API call, no separate service, no extra SDK.

Python Example

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.tryspeakeasy.io/v1",
)

with open("meeting.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=audio_file,
        response_format="verbose_json",
        extra_body={"diarize": True},
    )

curl Example

curl -X POST https://api.tryspeakeasy.io/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F model="whisper-large-v3" \
  -F file="@meeting.mp3" \
  -F response_format="verbose_json" \
  -F diarize="true"

Processing the Results

The API returns segments with speaker labels. Here's how to turn that into readable output:

# Group consecutive segments by speaker
current_speaker = None
current_text = []

for segment in transcript.segments:
    speaker = segment.get("speaker", "Unknown")

    if speaker != current_speaker:
        if current_speaker is not None:
            print(f"\n**Speaker {current_speaker}:** {' '.join(current_text)}")
        current_speaker = speaker
        current_text = [segment["text"].strip()]
    else:
        current_text.append(segment["text"].strip())

# Print the last speaker's text
if current_text:
    print(f"\n**Speaker {current_speaker}:** {' '.join(current_text)}")

Sample output:

**Speaker 0:** Welcome everyone. Let's review last quarter's results.

**Speaker 1:** Thanks. Revenue grew 15% quarter over quarter, driven mainly by the enterprise segment.

**Speaker 0:** That's great. What about customer retention?

**Speaker 2:** Retention held steady at 94%. We did see a slight dip in the SMB tier.

Combining Diarization with Timestamps

For applications like video editors or searchable archives, combine diarization with timestamps:

for segment in transcript.segments:
    speaker = segment.get("speaker", "?")
    start = segment["start"]
    end = segment["end"]
    text = segment["text"].strip()
    print(f"[{start:.1f}s - {end:.1f}s] Speaker {speaker}: {text}")

This gives you a time-coded, speaker-attributed transcript that you can use for subtitle generation, meeting summaries, or compliance records.

Tips for Better Results

  1. Use high-quality audio. Clear recordings with minimal background noise produce more accurate speaker separation.
  2. Minimize crosstalk. When speakers talk over each other, diarization accuracy drops. Directional microphones help.
  3. Longer is often better. The model needs enough speech per speaker to distinguish voices reliably. Very short clips (under 30 seconds) may produce less accurate results.
  4. Don't pre-split. Send the full recording. The model performs better with complete context.

Pricing

Speaker diarization is included in SpeakEasy's standard transcription pricing at no extra charge. Check our pricing page for current rates.

Next Steps

Speaker diarization transforms raw transcription into structured, actionable data. Pair it with your existing workflows for meeting notes, analytics dashboards, or compliance documentation.

Read the full speech-to-text documentation for all available parameters, or explore our text-to-speech API to generate audio from your content.

Create your free SpeakEasy account and start building with diarization today.

SPEAKY