
If you’ve ever used a voice assistant and noticed it starts listening at the right moment, or looked at a podcast transcript that correctly separates speakers, you’ve seen two different audio skills working behind the scenes. I ran into this distinction the hard way while building transcription flows: I could get accurate words from Whisper, but the experience still felt “wrong” until I handled when to listen and who was speaking.
That’s where Voice Activity Detection (VAD) and Speaker Diarization come in. In this blog, I’ll break down both concepts clearly and show how I implement them using tools like Faster-Whisper and Resemblyzer, with practical code so you can detect speech segments and label speakers step-by-step.
| Aspect | Voice Activity Detection (VAD) | Speaker Diarization |
Primary purpose | Detects when speech occurs | Identifies who is speaking |
Focus | Speech vs non-speech | Speaker identity |
Handles silence | Yes | No |
Handles multiple speakers | No | Yes |
Output | Speech segments | Speaker-labeled segments |
Works independently | Yes | No (requires speech segments) |
Role in Whisper pipeline | Pre-processing | Post-processing |
Typical use cases | Noise removal, latency reduction | Meetings, interviews, call analysis |
This comparison captures the difference I wish I had clarified earlier: VAD is about timing (speech vs non-speech), and diarization is about attribution (who is speaking). Whisper-based systems feel incomplete unless you treat these as separate steps with different outputs and failure modes.
In Whisper-based pipelines, VAD and speaker diarization play complementary roles, and I’ve found it helps to think of them as pre-processing vs post-processing. I apply VAD first to filter silence and background noise, so Whisper spends its computation only on meaningful speech. After transcription, I apply diarization to group and label segments by speaker identity, which is what makes meeting and podcast transcripts actually readable.
This VAD + diarization flow is common in meeting transcription, interviews, podcasts, and call analysis, where you need both clean text and speaker attribution.

Voice Activity Detection (VAD) is a technique that determines whether an audio signal contains human speech or not. In practical systems, it’s the piece that prevents me from wasting transcription time on silence, background noise, or music. VAD detects whether someone is speaking and separates speech segments from non-speech segments, so the pipeline can focus only on spoken content.
It detects whether the person is speaking or if there is silence. It distinguishes between speech segments and non-speech segments(like silence, background noise or music), allowing systems to focus only on the parts of audio that contain spoken words.
Eliminates Non-Speech Audio: Removes silence and background noise so only spoken content is processed.
Increased Accuracy: Reduces transcription errors by preventing non-speech from being interpreted as speech.
Increased Efficiency: Cuts compute and energy usage by processing only speech segments (especially important in real-time).
Reduced Latency: Helps systems respond faster because they detect speech boundaries early.
False Positives in Noisy Environments: In crowded audio, I’ve seen VAD trigger on non-speech sounds and waste processing.
Missed Speech in Low-Volume Input: Quiet speech can be treated as silence, which can drop words or entire phrases.
Speaker Awareness is Limited: VAD doesn’t identify who is speaking—it only detects whether speech exists. Diarization is still required for speaker labels.
Whisper models are highly accurate for ASR, but in raw form, they’ll still try to transcribe everything, including silence and non-speech noise. In real meeting or podcast audio, I’ve found that leads to messy outputs and wasted compute. Faster-Whisper helps here because it’s optimized for speed and supports built-in VAD filtering, so Whisper focuses on speech segments and produces cleaner, more efficient transcripts.
Let’s implement it.
Important Libraries
!pip install faster-whisper gradio soundfile resampy numpyCode Below
from faster_whisper import WhisperModel
import soundfile as sf
import resampy
import numpy as np
# Load the Faster-Whisper Large-v3 model
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
# Load and resample audio to 16kHz mono
audio_data, sr = sf.read("your_audio_file.wav")
if len(audio_data.shape) > 1:
audio_data = np.mean(audio_data, axis=1) # Convert stereo to mono
audio_16k = resampy.resample(audio_data, sr, 16000).astype(np.float32)
# Transcribe using Voice Activity Detection (VAD)
segments, info = model.transcribe(
audio_16k,
vad_filter=True,
vad_parameters={
"threshold": 0.5,
"min_speech_duration_ms": 250,
"max_speech_duration_s": 20,
"min_silence_duration_ms": 200
}
)
# Format and print the transcript with timestamps
for segment in segments:
print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text.strip()}")This is how I think about the stack used here: Faster-Whisper handles ASR, and the rest of the libraries make the audio Whisper-ready.
Walk away with actionable insights on AI adoption.
Limited seats available!
I used large-v3 for accuracy, but I switched to base/medium/small when running on CPU or when latency matters more than perfect transcription.
These parameters are the knobs I tune most when VAD feels “too sensitive” or “not sensitive enough.” Small changes here can be the difference between missing soft speech and incorrectly capturing background noise as speech.
| Parameter | Meaning |
threshold | Confidence threshold (0.0–1.0). Higher = more strict in detecting speech |
min_speech_duration_ms | Minimum duration (in ms) to treat as valid speech |
max_speech_duration_s | Maximum duration (in sec) allowed in one segment |
min_silence_duration_ms | Minimum silence between segments (in ms) |
Now that VAD is handling when speech happens, the next real-world requirement is usually who said what. That’s where speaker diarization comes in.
Speaker diarization is the process of partitioning an audio stream into segments based on who spoke when. When I’m working with meetings, interviews, or calls, diarization is what turns a transcript from a block of text into something you can actually read, search, and analyze.
It involves:
Identifies “Who Spoke When”: Adds structure for meetings, interviews, and conversations.
Improves Transcript Readability: Speaker labels make transcripts easier to follow and review.
Enables Speaker-Based Analytics: Supports speaking-time analysis and interaction insights for business/legal/support workflows.
Supports Multi-Speaker Applications: Useful for podcasts, collaborative tools, and call review systems.
Requires High-Quality Audio: Overlap and low-quality recordings can cause speaker merging or mislabeling.
May Struggle with Similar Voices: Speakers with similar tone/timbre can confuse clustering.
No Built-in Transcription: Diarization needs ASR (like Whisper) to produce the actual text.
Resemblyzer is a Python library that produces voice embeddings, fixed-length vectors that capture a speaker’s voice characteristics. I use embeddings like this when I want diarization without speaker-labelled data, because the pipeline becomes: extract embeddings per segment, then cluster them into speakers.
These embeddings can be used for:
This is the diarization flow I follow when I’m keeping things simple and unsupervised:
Let's see how to implement this
Walk away with actionable insights on AI adoption.
Limited seats available!
Important Libraries
!pip install openai-whisper resemblyzer librosa scikit-learnCode Below:
import whisper
import numpy as np
import librosa
from resemblyzer import VoiceEncoder
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Load models
whisper_model = whisper.load_model("large")
encoder = VoiceEncoder()
# Load audio and run transcription
audio_path = "your_audio_file.wav"
wav, sr = librosa.load(audio_path, sr=16000)
result = whisper_model.transcribe(audio_path)
segments = result.get("segments", [])
# Extract speaker embeddings from each segment
embeddings, valid_segments = [], []
for seg in segments:
start, end = seg["start"], seg["end"]
audio_seg = wav[int(start * sr):int(end * sr)]
if len(audio_seg) > 0:
emb = encoder.embed_utterance(audio_seg)
embeddings.append(emb)
valid_segments.append(seg)
# Check if enough segments are found
if len(embeddings) < 2:
raise ValueError("Not enough speech segments detected for diarization.")
# Automatically determine optimal number of speakers
best_k, best_score = 2, -1
for k in range(2, min(10, len(embeddings)) + 1):
kmeans = KMeans(n_clusters=k, random_state=0, n_init="auto")
labels = kmeans.fit_predict(embeddings)
score = silhouette_score(embeddings, labels)
if score > best_score:
best_k, best_score = k, score
# Final clustering with optimal k
kmeans = KMeans(n_clusters=best_k, random_state=0, n_init="auto")
labels = kmeans.fit_predict(embeddings)
# Explicitly map the first detected speaker as Speaker 1
first_label = labels[0]
label_mapping = {first_label: 1}
current_speaker = 2
for lbl in set(labels):
if lbl != first_label:
label_mapping[lbl] = current_speaker
current_speaker += 1
# Print diarized transcript with timestamps
for seg, label in zip(valid_segments, labels):
speaker_id = label_mapping[label]
print(f"Speaker {speaker_id} [{seg['start']:.2f}s - {seg['end']:.2f}s]: {seg['text']}")This pipeline works for transcription tasks with multi-speaker, such as:
Output:
Speaker 1 [0.00s - 2.34s]: Hello, how are you?
Speaker 2 [2.34s - 4.12s]: I'm good, thanks!In production systems, I rarely use VAD or diarization alone. VAD keeps the pipeline efficient by filtering non-speech, and diarization makes results usable by labeling speakers. Together, they’re common in meeting transcription, call center analytics, podcast editing, and assistants, anywhere you need both clean transcripts and speaker-aware structure.
VAD tells me when speech happens (speech vs non-speech). Diarization tells me who is speaking during those speech segments by assigning speaker labels.
Whisper supports Voice Activity Detection through VAD filtering to remove non-speech audio. Speaker diarization is not built into Whisper and is typically implemented using external tools like Resemblyzer or Pyannote.
In a typical speech processing pipeline, VAD runs first to detect speech segments. After transcription, speaker diarization is applied to group and label speakers within those segments.
VAD is not strictly required, but it significantly improves diarization accuracy by removing silence and background noise before speaker embedding and clustering.
VAD and speaker diarization are two building blocks I rely on whenever I want a Whisper pipeline to feel production-ready. VAD removes silence and noise so the system focuses on speech, and diarization adds structure by labeling who spoke when. Together, they turn raw audio into transcripts that are both accurate and usable for multi-speaker scenarios.
When I combine these with strong speech-to-text models, I get a pipeline that supports clean transcription, speaker attribution, and downstream analysis, exactly what you need for meetings, interviews, podcasts, and call review workflows.
Walk away with actionable insights on AI adoption.
Limited seats available!