·SpeakEasy Team·5 min read

Voice Activity Detection in Python: The Complete Guide

Learn how to implement voice activity detection in Python using webrtcvad, Silero VAD, and pyannote.audio — then pipe detected speech straight into a transcription API.

PythonSpeech-to-TextTutorial

Most audio is silence. Meeting recordings, podcast episodes, call center logs — a huge chunk of every file is dead air. If you're sending all of that to a speech-to-text API, you're wasting time and money on nothing.

Voice Activity Detection (VAD) solves this. It tells you exactly which parts of an audio stream contain speech and which don't. Detect the speech segments first, then transcribe only what matters.

This guide covers the three most popular Python VAD approaches — webrtcvad, Silero VAD, and pyannote.audio — with working code examples. At the end, we'll wire it all together with SpeakEasy's STT API so you go from raw audio to transcript in one pipeline.

What Is Voice Activity Detection?

VAD is a binary classifier for audio. It takes small chunks of audio (typically 10–30ms frames) and answers one question: is someone speaking right now?

The output is simple — speech or not speech — but getting it right in real-world conditions is surprisingly hard. Background noise, HVAC hum, coughing, laughter, music — all of these can fool a naive detector.

Why Developers Need VAD

  • Cut transcription costs. A 1-hour recording might contain only 15 minutes of actual speech. VAD lets you skip the other 45 minutes.
  • Improve STT accuracy. Feeding silence and background noise into a speech model introduces hallucinated transcription errors. Clean input = clean output.
  • Enable real-time pipelines. VAD tells your app when to start and stop listening, so you don't buffer forever waiting for a timeout.
  • Build smarter features. Speaker turn detection, silence analysis, conversation pacing — all start with knowing when someone is talking.

Approach 1: webrtcvad (Lightweight and Fast)

py-webrtcvad is a Python wrapper around Google's WebRTC Voice Activity Detection engine. It uses a traditional Gaussian Mixture Model (GMM) — no neural network, no GPU, minimal dependencies.

Best for: Simple use cases, edge devices, when you need maximum speed with minimal overhead.

Install

pip install webrtcvad librosa numpy

Basic Usage

import struct
import numpy as np
import librosa
import webrtcvad

# Load audio at 16kHz (required by webrtcvad)
y, sr = librosa.load("recording.wav", sr=16000)

# Convert float32 to int16 (webrtcvad requirement)
y_int = np.clip(y * 32768, -32768, 32767).astype(np.int16)
raw_samples = struct.pack("%dh" % len(y_int), *y_int)

# Initialize VAD — aggressiveness 0 (least) to 3 (most)
vad = webrtcvad.Vad(2)

# Process in 30ms frames
frame_duration = 0.03  # seconds
samples_per_frame = int(frame_duration * sr)
bytes_per_sample = 2

speech_frames = []
for start in range(0, len(y_int), samples_per_frame):
    end = min(start + samples_per_frame, len(y_int))
    frame_bytes = raw_samples[start * bytes_per_sample : end * bytes_per_sample]
    
    if len(frame_bytes) < samples_per_frame * bytes_per_sample:
        break  # skip incomplete final frame
    
    is_speech = vad.is_speech(frame_bytes, sample_rate=sr)
    speech_frames.append({
        "start": start / sr,
        "end": end / sr,
        "is_speech": is_speech,
    })

# Get speech segments
speech_segments = [f for f in speech_frames if f["is_speech"]]
print(f"Found {len(speech_segments)} speech frames out of {len(speech_frames)} total")

Tuning the Aggressiveness Parameter

The aggressiveness parameter (0–3) controls the trade-off between false positives and false negatives:

  • 0 — Least aggressive. Catches more speech but also more noise. Good for quiet environments.
  • 3 — Most aggressive. Strict filtering. May miss some speech but rarely triggers on noise.

Start with 2 for most use cases. Bump to 3 for noisy recordings where you'd rather miss some speech than get false positives.

Limitations

webrtcvad is fast and battle-tested, but it uses older signal processing techniques. It struggles with:

  • Low signal-to-noise environments
  • Music or rhythmic background noise
  • Distant microphones or heavy reverb

For anything beyond simple, clean recordings, you'll want a deep learning approach.

Approach 2: Silero VAD (Best Balance of Speed and Accuracy)

Silero VAD is a deep learning model that runs on CPU in under 1ms per audio chunk. It's trained on 6,000+ languages, works at 8kHz or 16kHz, and the model is about 2MB. MIT licensed, no API key needed.

Here's the reality: for most Python developers, Silero VAD is the right choice. It's dramatically more accurate than webrtcvad with negligible speed difference on modern hardware.

Install

pip install silero-vad torchaudio

Basic Usage

from silero_vad import load_silero_vad, read_audio, get_speech_timestamps

# Load model (downloads ~2MB on first run)
model = load_silero_vad()

# Read audio file
wav = read_audio("recording.wav")

# Get speech timestamps
speech_timestamps = get_speech_timestamps(
    wav,
    model,
    return_seconds=True,
)

for segment in speech_timestamps:
    print(f"Speech: {segment['start']:.2f}s - {segment['end']:.2f}s")

Output:

Speech: 0.10s - 3.42s
Speech: 5.68s - 12.15s
Speech: 14.20s - 18.90s

That's it. Three lines of real code to get speech timestamps from any audio file.

Real-Time Streaming with VADIterator

For real-time applications (live microphone input, streaming audio), use Silero's VADIterator:

from silero_vad import load_silero_vad, VADIterator
import torch

model = load_silero_vad()
vad_iterator = VADIterator(model, sampling_rate=16000)

# Simulate streaming — in production, read from microphone
def process_audio_stream(audio_chunks):
    for chunk in audio_chunks:
        speech_dict = vad_iterator(chunk)
        if speech_dict:
            print(speech_dict)  # {'start': 1234, 'end': 5678}
    
    vad_iterator.reset_states()  # reset after each audio session

One important note: always call model.reset_states() or vad_iterator.reset_states() between separate audio files. The model is stateful — it carries context between chunks for better accuracy, but that context needs to reset between different recordings.

Approach 3: pyannote.audio (Research-Grade, Feature-Rich)

pyannote.audio is a deep learning toolkit for speaker diarization that includes a powerful VAD pipeline. It's the heaviest option but also the most feature-rich — if you need speaker diarization alongside VAD, pyannote is the way to go.

Best for: Research, speaker diarization pipelines, when accuracy matters more than speed.

Install

pip install pyannote.audio

You'll also need a Hugging Face token and to accept the model terms on the pyannote model page.

Basic Usage

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
    "pyannote/voice-activity-detection",
    use_auth_token="YOUR_HF_TOKEN",
)

output = pipeline("recording.wav")

for speech_turn, _, _ in output.itertracks(yield_label=True):
    print(f"Speech: {speech_turn.start:.2f}s - {speech_turn.end:.2f}s")

pyannote loads larger models and typically requires a GPU for reasonable speed. For pure VAD without diarization, Silero is more practical. But if you're building a pipeline that needs both VAD and speaker identification, pyannote handles everything in one toolkit.

Which VAD Should You Use?

webrtcvadSilero VADpyannote.audio
ApproachSignal processing (GMM)Deep learning (ONNX/PyTorch)Deep learning (PyTorch)
AccuracyGood in clean audioExcellent across conditionsExcellent
SpeedFastest< 1ms per chunk on CPUSlower, benefits from GPU
Model sizeTiny (C library)~2MB100MB+
DependenciesMinimalPyTorch or ONNX RuntimePyTorch + Hugging Face
Best forEdge, IoT, simple casesGeneral-purpose productionResearch, diarization

The short answer: Use Silero VAD unless you have a specific reason not to. It hits the sweet spot of accuracy, speed, and ease of use.

The Full Pipeline: VAD + SpeakEasy Transcription

Here's where it gets useful. VAD detects the speech — now you need to transcribe it. The natural next step is to pipe those speech segments into a Speech-to-Text API.

SpeakEasy's STT API is 100% compatible with the OpenAI SDK, so if you've used OpenAI's transcription endpoint before, you already know how to use it. The only difference: point the base URL at SpeakEasy and pay a fraction of the price.

Here's a complete pipeline that uses Silero VAD to detect speech, extracts only the speech segments, and sends them to SpeakEasy for transcription:

import io
import torch
import torchaudio
from silero_vad import load_silero_vad, read_audio, get_speech_timestamps
from openai import OpenAI

# Step 1: Load audio and detect speech
vad_model = load_silero_vad()
wav = read_audio("meeting.wav")

speech_timestamps = get_speech_timestamps(
    wav, vad_model, return_seconds=False  # return sample indices
)

# Step 2: Extract only the speech segments
SAMPLE_RATE = 16000
speech_chunks = []
for ts in speech_timestamps:
    speech_chunks.append(wav[ts["start"]:ts["end"]])

speech_only = torch.cat(speech_chunks)

total_duration = len(wav) / SAMPLE_RATE
speech_duration = len(speech_only) / SAMPLE_RATE
print(f"Original: {total_duration:.1f}s | Speech only: {speech_duration:.1f}s")
print(f"Reduced audio by {(1 - speech_duration/total_duration) * 100:.0f}%")

# Step 3: Save speech-only audio to a buffer
buffer = io.BytesIO()
torchaudio.save(buffer, speech_only.unsqueeze(0), SAMPLE_RATE, format="wav")
buffer.seek(0)

# Step 4: Transcribe with SpeakEasy
client = OpenAI(
    api_key="YOUR_SPEAKEASY_API_KEY",
    base_url="https://tryspeakeasy.io/api/v1",
)

transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=("speech.wav", buffer),
)

print("\n--- Transcript ---")
print(transcript.text)

Output:

Original: 3600.0s | Speech only: 842.0s
Reduced audio by 77%

--- Transcript ---
Welcome everyone. Let's get the weekly sync started. I wanted to start
with the product update from this week...

A 77% reduction in audio sent to the API means faster transcription and lower cost. For a 1-hour meeting with typical talk-to-silence ratios, you're transcribing minutes instead of an hour.

Production Tips

Merge close segments. VAD sometimes splits a continuous sentence into multiple segments with tiny gaps. Add a minimum gap threshold (e.g., 300ms) — if two speech segments are closer than that, merge them.

Add padding. Trim speech segments too tightly and you'll clip the beginning and end of words. Add 100–200ms of padding before and after each segment.

Batch your API calls. Instead of sending each speech segment as a separate request, concatenate them into a single audio file (with short silence gaps) and send one request. Fewer API calls, less overhead.

Reset model state. If you're processing multiple files with Silero VAD, call model.reset_states() between files. The model is stateful and will carry context from the previous file otherwise.

Wrapping Up

Voice Activity Detection is the preprocessing step that most developers skip — and then wonder why their transcription pipeline is slow and expensive. The tools are mature, the code is straightforward, and the payoff is immediate.

For most Python developers, the stack is simple: Silero VAD for detection, SpeakEasy for transcription. Detect the speech, skip the silence, transcribe what matters.


Ready to build? Create your SpeakEasy account — $1 first month, 50 hours of transcription included — and start transcribing in minutes. Check out the speech-to-text documentation for the complete API reference.

Keep reading

$1. 50 hours. Both STT and TTS.

Your current speech API provider is charging you too much. Switch in one line of code.