Blogs/AI/What is VAD and Diarization With Whisper Models (A Complete Guide)?

What is VAD and Diarization With Whisper Models (A Complete Guide)?

Jul 1, 2025 • 6 Min Read

Written by Shubhanshu Navadiya

What is VAD and Diarization With Whisper Models (A Complete Guide)? Hero

Have you ever interacted with a voice assistant and been impressed by how it knows when to start listening to you just as you are about to say something, and does not start listening too early or too late? Or listened to a transcript of a podcast where it correctly identifies multiple speakers, without any speaker already identified?

These are typical cases of audio processing where two competencies of audio AI come together: Voice Activity Detection (VAD) and Speaker Diarization.

In this blog, we’ll break down the core concepts of Voice Activity Detection (VAD) and Speaker Diarization. Using powerful tools like Faster-Whisper and Resemblyzer, we'll implement basic code examples to show how you can detect speech segments and identify who is speaking step-by-step.

What is Voice Activity Detection (VAD)?

Voice Activity Detection is a signal processing technique used to identify the presence or absence of human speech in an audio signal.

It detects whether the person is speaking or if there is silence. It distinguishes between speech segments and non-speech segments(like silence, background noise or music), allowing systems to focus only on the parts of audio that contain spoken words.

Advantages of VAD (Voice Activity Detection)

Eliminates Non-Speech Audio: Removes silence and background noise to ensure only spoken sounds are processed.

Increased Accuracy: Cleans non-speech audio from input to speech recognition systems, thus reducing the risk of transcribing errors.

Increased Efficiency: Only the speech portions of the audio are processed, reducing computational load and energy usage especially important for real-time systems.

Reduced Latency: Detection of speech audio is instantaneous, providing for a more rapid response in voice assistants or live situations.

Disadvantages of Voice Activity Detection (VAD)

False Positives in Noisy Environments: It can confuse non-speech with speech in crowded or noisy environments, thus leading to wasted processing.

Missed Speech in Low-Volume or Whispered Input: Quiet or whispered speech can be confused with silence, resulting in parts of the speech being missed.

Speaker Awareness is Limited: The VAD does not identify who is speaking, it only assesses if speech occurs. In order to track speakers, other tools like diarization will be needed.

VAD with Faster Whisper

Whisper by OpenAI, powerful automatic speech recognition (ASR) models, have provided high accuracy and multilingual transcription capabilities. However, Whisper will transcribe everything including silence and background noise. For real-world scenarios, like a meeting or podcast, this may not be desirable.

Faster-Whisper is a fast, optimized version of the Whisper model, but it adds built-in Voice Activity Detection (VAD) to ignore non-speech audio and provide more cleaner, more efficient, good transcription output.

Let's see how to implement this.

Important Libraries

!pip install faster-whisper gradio soundfile resampy numpy

Code Below

from faster_whisper import WhisperModel
import soundfile as sf
import resampy
import numpy as np


# Load the Faster-Whisper Large-v3 model
model = WhisperModel("large-v3", device="cuda", compute_type="float16")


# Load and resample audio to 16kHz mono
audio_data, sr = sf.read("your_audio_file.wav")
if len(audio_data.shape) > 1:
    audio_data = np.mean(audio_data, axis=1)  # Convert stereo to mono
audio_16k = resampy.resample(audio_data, sr, 16000).astype(np.float32)


# Transcribe using Voice Activity Detection (VAD)
segments, info = model.transcribe(
    audio_16k,
    vad_filter=True,
    vad_parameters={
        "threshold": 0.5,
        "min_speech_duration_ms": 250,
        "max_speech_duration_s": 20,
        "min_silence_duration_ms": 200
    }
)


# Format and print the transcript with timestamps
for segment in segments:
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text.strip()}")

Suggested Reads- List of 6 Speech-to-Text Models (Open & Closed Source)

Key Libraries and Model Configuration Used in Faster-Whisper ASR

faster_whisper to load and run the ASR model.
soundfile and resampy to load and convert audio to 16kHz (required by Whisper).
numpy for basic audio processing.

Partner with Us for Success

Experience seamless collaboration and exceptional results.

Used here the "large-v3" version of Whisper for higher accuracy. You can also use "base", "medium", or "small" if you're on a CPU or limited resources.

VAD Parameters

Parameter	Meaning
threshold	Confidence threshold (0.0–1.0). Higher = more strict in detecting speech
min_speech_duration_ms	Minimum duration (in ms) to treat as valid speech
max_speech_duration_s	Maximum duration (in sec) allowed in one segment
min_silence_duration_ms	Minimum silence between segments (in ms)

threshold

Meaning

Confidence threshold (0.0–1.0). Higher = more strict in detecting speech

1 of 4

We have covered the practical application of Voice Activity Detection, let’s now dive into Speaker Diarization:

Suggested Reads- 13 Text-to-Speech (TTS) Solutions in 2025

What is Speaker Diarization?

Speaker Diarization is the process of partitioning an audio stream into segments according to who spoke when and what. It is often referred to as the "who spoke when" problem.

This involves:

Identifying distinct speakers.
Assigning speaker labels like Speaker 1, Speaker 2, etc.
Outputting a transcript or timeline showing which speaker spoke at each time

Advantages of Speaker Diarization

Identifies "Who Spoke When": Provides structure by using labels to denote speakers in meetings, interviews, and conversations.

Improves the Readability of Transcriptions: Associates speech with speakers, so that the transcripts are more orderly and succinct.

Enables Speaker-Based Analytics: Permits analysis of participation, speaking time, and interaction types across different contexts (for business, legal, or customer service applications).

Assures Multi-Speakers Applications: Supports the development of smart tools in collaborative environments (group discussions, podcasts, courtrooms).

Disadvantages of Speaker Diarization

Requires High-Quality Audio: Diarization accuracy drops significantly if speech is overlapping or there is a poor-quality recording, leading to misidentifying a speaker or merging segments of speakers.

May Struggle with Similar Voices: Diarization systems may become confused with speakers who have similar voices, leading to inaccurate segmentation of speakers.

No Built-in Transcription: Diarization segments speakers but does not transcribe the speech uttered. Therefore, it must be paired with an ASR (Automatic Speech Recognition) system to achieve full transcription capabilities.

Speaker Diarization with Whisper and Resemblyzer

What is Resemblyzer?

Resemblyzer is a Python library that is designed to produce voice embeddings, fixed-length vectors that provide an audio speaker’s unique voice characteristics.

The embeddings can be used for the following:

Comparing voice samples from different audio recordings.
Clustering speech by the identity of the speaker.
Carrying out unsupervised speaker diarization without the need for speaker-labelled data.

How Does Resemblyzer Work?

To understand the pipeline a little better, here is a simplified overview of speaker diarization with Whisper and Resemblyzer:

Partner with Us for Success

Experience seamless collaboration and exceptional results.

1. Audio Input: Load audio and convert to 16 kHz.

2. Transcription: Whisper transcribes the audio and outputs segment timestamps.

3. Embedding Generation: Each segment is sent through Resemblyzer, which extracts a voice embedding.

4. Clustering: The embeddings are clustered (e.g., using K-Means) to group segments of the same speaker together.

5. Labelling: The segments are labelled by speaker (e.g., Speaker 1, Speaker 2).

6. Output: You get a structured transcript that is attributed by the speaker.

Let's see how to implement this

Important Libraries

!pip install openai-whisper resemblyzer librosa scikit-learn

Code Below:

import whisper
import numpy as np
import librosa
from resemblyzer import VoiceEncoder
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Load models
whisper_model = whisper.load_model("large")
encoder = VoiceEncoder()

# Load audio and run transcription
audio_path = "your_audio_file.wav"
wav, sr = librosa.load(audio_path, sr=16000)
result = whisper_model.transcribe(audio_path)
segments = result.get("segments", [])

# Extract speaker embeddings from each segment
embeddings, valid_segments = [], []
for seg in segments:
    start, end = seg["start"], seg["end"]
    audio_seg = wav[int(start * sr):int(end * sr)]
    if len(audio_seg) > 0:
        emb = encoder.embed_utterance(audio_seg)
        embeddings.append(emb)
        valid_segments.append(seg)

# Check if enough segments are found
if len(embeddings) < 2:
    raise ValueError("Not enough speech segments detected for diarization.")

# Automatically determine optimal number of speakers
best_k, best_score = 2, -1
for k in range(2, min(10, len(embeddings)) + 1):
    kmeans = KMeans(n_clusters=k, random_state=0, n_init="auto")
    labels = kmeans.fit_predict(embeddings)
    score = silhouette_score(embeddings, labels)
    if score > best_score:
        best_k, best_score = k, score

# Final clustering with optimal k
kmeans = KMeans(n_clusters=best_k, random_state=0, n_init="auto")
labels = kmeans.fit_predict(embeddings)

# Explicitly map the first detected speaker as Speaker 1
first_label = labels[0]
label_mapping = {first_label: 1}
current_speaker = 2
for lbl in set(labels):
    if lbl != first_label:
        label_mapping[lbl] = current_speaker
        current_speaker += 1

# Print diarized transcript with timestamps
for seg, label in zip(valid_segments, labels):
    speaker_id = label_mapping[label]
    print(f"Speaker {speaker_id} [{seg['start']:.2f}s - {seg['end']:.2f}s]: {seg['text']}")

Model Configuration and Speaker Diarization with Resemblyzer Approach

Used OpenAI Whisper (large) to transcribe audio and produce timestamped speech segments.
Used Resemblyzer to extract speaker embeddings from each speech segment.
Applied KMeans clustering on the embeddings in order to group speech segments by speaker identity (speaker diarization).
Labeled the first detected speaker as Speaker 1, and numbered the remaining detected speakers incrementally (Speaker 2, Speaker 3 and so on) for simplicity and ease so that the speaker may not be assigned arbitrarily.
librosa is a popular Python library for audio analysis and music/speech signal processing.It provides easy access to a rich set of audio features useful for modeling speaker identity and voice patterns.
The silhouette score measures how well data points fit within their clusters compared to others, helping evaluate clustering quality. It’s used to ensure a clear, meaningful separation key for tasks like speaker diarization.

This pipeline works for transcription tasks with multi-speaker, such as:

Meeting transcriptions with labelled speakers
Podcast editing with host/guest distinctions
Customer service call review for quality monitoring

Output:

Speaker 1 [0.00s - 2.34s]: Hello, how are you?
Speaker 2 [2.34s - 4.12s]: I'm good, thanks!

Conclusion

Voice Activity Detection (VAD) and Speaker Diarization are crucial building blocks for intelligent audio applications for human awareness. VAD eliminates silence or noise so applications can just pay attention to spoken content. Diarization determines who was speaking at the same time, as well organizing the conversation with labels for each speaker.

The integration of VAD and Speaker Diarization transforms raw audio into actionable data. By detecting when speech occurs and identifying who is speaking, these techniques form a powerful duo that enables accurate transcription, speaker tracking, and smarter audio analysis in any multi-speaker environment.

Shubhanshu Navadiya

AI/ML Intern

Passionate about AI and machine learning innovations, exploring the future of technology and its impact on society. Join me on this journey of discovery and growth.

Next for you

How to Use Hugging Face with OpenAI-Compatible APIs? Cover

AI

Jul 29, 2025 • 4 min read

How to Use Hugging Face with OpenAI-Compatible APIs?

As large language models become more widely adopted, developers are looking for flexible ways to integrate them without being tied to a single provider. Hugging Face’s newly introduced OpenAI-compatible API offers a practical solution, allowing you to run models like LLaMA, Mixtral, or DeepSeek using the same syntax as OpenAI’s Python client. According to Hugging Face, hundreds of models are now accessible using the OpenAI-compatible client across providers like Together AI, Replicate, and more.

Transformers vs vLLM vs SGLang: Comparison Guide Cover

AI

Jul 29, 2025 • 7 min read

Transformers vs vLLM vs SGLang: Comparison Guide

These are three of the most popular tools for running AI language models today. Each one offers different strengths when it comes to setup, speed, memory use, and flexibility. In this guide, we’ll break down what each tool does, how to get started with them, and when you might want to use one over the other. Even if you're new to AI, you'll walk away with a clear understanding of which option makes the most sense for your needs, whether you're building an app, speeding up model inference, or cr

What is vLLM? Everything You Should Know Cover

AI

Jul 29, 2025 • 8 min read

What is vLLM? Everything You Should Know

If you’ve ever used AI tools like ChatGPT and wondered how they’re able to generate so many prompt responses so quickly, vLLM is a big part of the explanation. It’s a high-performance engine to make large language models (LLMs) run faster and more efficiently. This blog effectively summarizes what vLLM is, why it matters, how it works and how developers can use it. Whether you’re a developer looking to accelerate your AI models or simply curious about the inner workings of AI, this guide will