Facebook iconWhat is VAD and Diarization With Whisper Models (A Complete Guide)?
Blogs/AI

What is VAD and Diarization With Whisper Models (A Complete Guide)?

Jun 30, 20256 Min Read
Written by Shubhanshu Navadiya
What is VAD and Diarization With Whisper Models (A Complete Guide)? Hero

Have you ever interacted with a voice assistant and been impressed by how it knows when to start listening to you just as you are about to say something, and does not start listening too early or too late? Or listened to a transcript of a podcast where it correctly identifies multiple speakers, without any speaker already identified? 

These are typical cases of audio processing where two competencies of audio AI come together: Voice Activity Detection (VAD) and Speaker Diarization.

In this blog, we’ll break down the core concepts of Voice Activity Detection (VAD) and Speaker Diarization. Using powerful tools like Faster-Whisper and Resemblyzer, we'll implement basic code examples to show how you can detect speech segments and identify who is speaking step-by-step.

What is Voice Activity Detection (VAD)?

Voice Activity Detection is a signal processing technique used to identify the presence or absence of human speech in an audio signal.

It detects whether the person is speaking or if there is silence. It distinguishes between speech segments and non-speech segments(like silence, background noise or music), allowing systems to focus only on the parts of audio that contain spoken words.

Advantages of VAD (Voice Activity Detection)

Eliminates Non-Speech Audio: Removes silence and background noise to ensure only spoken sounds are processed.

Increased Accuracy: Cleans non-speech audio from input to speech recognition systems, thus reducing the risk of transcribing errors.

Increased Efficiency: Only the speech portions of the audio are processed, reducing computational load and energy usage especially important for real-time systems.

Reduced Latency: Detection of speech audio is instantaneous, providing for a more rapid response in voice assistants or live situations.

Disadvantages of Voice Activity Detection (VAD)

False Positives in Noisy Environments: It can confuse non-speech with speech in crowded or noisy environments, thus leading to wasted processing.

Missed Speech in Low-Volume or Whispered Input: Quiet or whispered speech can be confused with silence, resulting in parts of the speech being missed.

Speaker Awareness is Limited: The VAD does not identify who is speaking, it only assesses if speech occurs. In order to track speakers, other tools like diarization will be needed.

VAD with Faster Whisper

Whisper by OpenAI, powerful automatic speech recognition (ASR) models, have provided high accuracy and multilingual transcription capabilities. However, Whisper will transcribe everything including silence and background noise. For real-world scenarios, like a meeting or podcast, this may not be desirable.

Faster-Whisper is a fast, optimized version of the Whisper model, but it adds built-in Voice Activity Detection (VAD) to ignore non-speech audio and provide more cleaner, efficient, good transcription output.

Important Libraries

!pip install faster-whisper gradio soundfile resampy numpy

Code Below

from faster_whisper import WhisperModel
import soundfile as sf
import resampy
import numpy as np


# Load the Faster-Whisper Large-v3 model
model = WhisperModel("large-v3", device="cuda", compute_type="float16")


# Load and resample audio to 16kHz mono
audio_data, sr = sf.read("your_audio_file.wav")
if len(audio_data.shape) > 1:
    audio_data = np.mean(audio_data, axis=1)  # Convert stereo to mono
audio_16k = resampy.resample(audio_data, sr, 16000).astype(np.float32)


# Transcribe using Voice Activity Detection (VAD)
segments, info = model.transcribe(
    audio_16k,
    vad_filter=True,
    vad_parameters={
        "threshold": 0.5,
        "min_speech_duration_ms": 250,
        "max_speech_duration_s": 20,
        "min_silence_duration_ms": 200
    }
)


# Format and print the transcript with timestamps
for segment in segments:
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text.strip()}")
Suggested Reads- List of 6 Speech-to-Text Models (Open & Closed Source)

Key Libraries and Model Configuration Used in Faster-Whisper ASR

  • faster_whisper to load and run the ASR model.
  • soundfile and resampy to load and convert audio to 16kHz (required by Whisper).
  • numpy for basic audio processing.

Partner with Us for Success

Experience seamless collaboration and exceptional results.

Used here the "large-v3" version of Whisper for higher accuracy. You can also use "base", "medium", or "small" if you're on a CPU or limited resources.

VAD Parameters

ParameterMeaning

threshold

Confidence threshold (0.0–1.0). Higher = more strict in detecting speech

min_speech_duration_ms

Minimum duration (in ms) to treat as valid speech

max_speech_duration_s

Maximum duration (in sec) allowed in one segment

min_silence_duration_ms

Minimum silence between segments (in ms)

threshold

Meaning

Confidence threshold (0.0–1.0). Higher = more strict in detecting speech

1 of 4

We have covered the practical application of Voice Activity Detection, let’s now dive into Speaker Diarization:

Suggested Reads- 13 Text-to-Speech (TTS) Solutions in 2025

What is Speaker Diarization?

Speaker Diarization is the process of partitioning an audio stream into segments according to who spoke when and what. It is often referred to as the "who spoke when" problem.

This involves:

  • Identifying distinct speakers.
  • Assigning speaker labels like Speaker 1, Speaker 2, etc.
  • Outputting a transcript or timeline showing which speaker spoke at each time

Advantages of Speaker Diarization

Identifies "Who Spoke When": Provides structure by using labels to denote speakers in meetings, interviews, and conversations.

Improves the Readability of Transcriptions: Associates speech with speakers, so that the transcripts are more orderly and succinct.

Enables Speaker-Based Analytics: Permits analysis of participation, speaking time, and interaction types across different contexts (for business, legal, or customer service applications).

Assures Multi-Speakers Applications: Supports the development of smart tools in collaborative environments (group discussions, podcasts, courtrooms).

Disadvantages of Speaker Diarization

Requires High-Quality Audio: Diarization accuracy drops significantly if speech is overlapping or there is a poor-quality recording, leading to misidentifying a speaker or merging segments of speakers. 

May Struggle with Similar Voices: Diarization systems may become confused with speakers who have similar voices, leading to inaccurate segmentation of speakers. 

No Built-in Transcription: Diarization segments speakers but does not transcribe the speech uttered. Therefore, it must be paired with an ASR (Automatic Speech Recognition) system to achieve full transcription capabilities.

Speaker Diarization with Whisper and Resemblyzer

What is Resemblyzer?

Resemblyzer is a Python library that is designed to produce voice embeddings, fixed-length vectors that provide an audio speaker’s unique voice characteristics.

The embeddings can be used for the following:

  • Comparing voice samples from different audio recordings.
  • Clustering speech by the identity of the speaker.
  • Carrying out unsupervised speaker diarization without the need for speaker-labelled data.

How Does Resemblyzer Work?

To understand the pipeline a little better, here is a simplified overview of speaker diarization with Whisper and Resemblyzer:

Partner with Us for Success

Experience seamless collaboration and exceptional results.

1. Audio Input: Load audio and convert to 16 kHz.

2. Transcription: Whisper transcribes the audio and outputs segment timestamps.

3. Embedding Generation: Each segment is sent through Resemblyzer, which extracts a voice embedding.

4. Clustering: The embeddings are clustered (e.g., using K-Means) to group segments of the same speaker together.

5. Labelling: The segments are labelled by speaker (e.g., Speaker 1, Speaker 2).

6. Output: You get a structured transcript that is attributed by the speaker. 

Important Libraries

!pip install openai-whisper resemblyzer librosa scikit-learn

Code Below:

import whisper
import numpy as np
import librosa
from resemblyzer import VoiceEncoder
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Load models
whisper_model = whisper.load_model("large")
encoder = VoiceEncoder()

# Load audio and run transcription
audio_path = "your_audio_file.wav"
wav, sr = librosa.load(audio_path, sr=16000)
result = whisper_model.transcribe(audio_path)
segments = result.get("segments", [])

# Extract speaker embeddings from each segment
embeddings, valid_segments = [], []
for seg in segments:
    start, end = seg["start"], seg["end"]
    audio_seg = wav[int(start * sr):int(end * sr)]
    if len(audio_seg) > 0:
        emb = encoder.embed_utterance(audio_seg)
        embeddings.append(emb)
        valid_segments.append(seg)

# Check if enough segments are found
if len(embeddings) < 2:
    raise ValueError("Not enough speech segments detected for diarization.")

# Automatically determine optimal number of speakers
best_k, best_score = 2, -1
for k in range(2, min(10, len(embeddings)) + 1):
    kmeans = KMeans(n_clusters=k, random_state=0, n_init="auto")
    labels = kmeans.fit_predict(embeddings)
    score = silhouette_score(embeddings, labels)
    if score > best_score:
        best_k, best_score = k, score

# Final clustering with optimal k
kmeans = KMeans(n_clusters=best_k, random_state=0, n_init="auto")
labels = kmeans.fit_predict(embeddings)

# Explicitly map the first detected speaker as Speaker 1
first_label = labels[0]
label_mapping = {first_label: 1}
current_speaker = 2
for lbl in set(labels):
    if lbl != first_label:
        label_mapping[lbl] = current_speaker
        current_speaker += 1

# Print diarized transcript with timestamps
for seg, label in zip(valid_segments, labels):
    speaker_id = label_mapping[label]
    print(f"Speaker {speaker_id} [{seg['start']:.2f}s - {seg['end']:.2f}s]: {seg['text']}")

Model Configuration and Speaker Diarization with Resemblyzer Approach

  • Used OpenAI Whisper (large) to transcribe audio and produce timestamped speech segments.
  • Used Resemblyzer to extract speaker embeddings from each speech segment.
  • Applied KMeans clustering on the embeddings in order to group speech segments by speaker identity (speaker diarization).
  • Labeled the first detected speaker as Speaker 1, and numbered the remaining detected speakers incrementally (Speaker 2, Speaker 3 and so on) for simplicity and ease so that the speaker may not be assigned arbitrarily.
  • librosa is a popular Python library for audio analysis and music/speech signal processing.It provides easy access to a rich set of audio features useful for modeling speaker identity and voice patterns.
  • The silhouette score measures how well data points fit within their clusters compared to others, helping evaluate clustering quality. It’s used to ensure a clear, meaningful separation key for tasks like speaker diarization.

This pipeline works for transcription tasks with multi-speaker, such as:

  • Meeting transcriptions with labelled speakers
  • Podcast editing with host/guest distinctions
  • Customer service call review for quality monitoring

Output:

Speaker 1 [0.00s - 2.34s]: Hello, how are you?
Speaker 2 [2.34s - 4.12s]: I'm good, thanks!

Conclusion

Voice Activity Detection (VAD) and Speaker Diarization are crucial building blocks for intelligent audio applications for human awareness. VAD eliminates silence or noise so applications can just pay attention to spoken content. Diarization determines who was speaking at the same time, as well organizing the conversation with labels for each speaker.

The integration of VAD and Speaker Diarization transforms raw audio into actionable data. By detecting when speech occurs and identifying who is speaking, these techniques form a powerful duo that enables accurate transcription, speaker tracking, and smarter audio analysis in any multi-speaker environment.

Author-Shubhanshu Navadiya
Shubhanshu Navadiya

Passionate about AI and machine learning innovations, exploring the future of technology and its impact on society. Join me on this journey of discovery and growth.

Phone

Next for you

How To Evaluate LLM Hallucinations and Faithfulness Cover

AI

Jun 30, 20258 min read

How To Evaluate LLM Hallucinations and Faithfulness

Large language models are widely used, and it’s important to make sure that the generated answers are accurate and provide correct information. Evaluating these aspects helps the developers and researchers to understand how reliable an LLM model is, especially in critical areas like healthcare, law, and education.  The main goal is to avoid the wrong answers and make sure the model gives the correct and fact-based information. In this blog, let’s learn about faithfulness and hallucinations in d

What Is Quantization and Its Practical Guide Cover

AI

Jun 30, 20253 min read

What Is Quantization and Its Practical Guide

Have you ever tried to run a powerful AI model but got an error saying your computer doesn't have enough memory? You're not alone. Today's AI models are massive, often requiring expensive GPUs with huge amounts of memory. Quantization is a clever technique that reduces model size by changing how numbers are stored, using simpler, less precise formats that need far less memory. Think of it like compressing a photo: you trade a small amount of quality for a much smaller file size. In this guide,

LLM Evaluation Metrics: A Complete Guide Cover

AI

Jun 25, 20259 min read

LLM Evaluation Metrics: A Complete Guide

As Large Language Models (LLMs) are being used in chatbots, virtual assistants, and content generation by AI, it’s essential to ensure that these models are not only powerful but are also robust, trustworthy, accurate, and safe.  In this blog, we take a look at how you can gauge the performance of LLMs through a structured LLM evaluation process, using both automated metrics and human judgment. The guide is aimed at developers, researchers, and AI enthusiasts who want to create, select, or fine