Facebook iconWhat is VAD and Diarization With Whisper Models (A Complete Guide)?
Blogs/AI

What is VAD and Diarization With Whisper Models (A Complete Guide)?

Jul 1, 20256 Min Read
Written by Shubhanshu Navadiya
What is VAD and Diarization With Whisper Models (A Complete Guide)? Hero

Have you ever interacted with a voice assistant and been impressed by how it knows when to start listening to you just as you are about to say something, and does not start listening too early or too late? Or listened to a transcript of a podcast where it correctly identifies multiple speakers, without any speaker already identified? 

These are typical cases of audio processing where two competencies of audio AI come together: Voice Activity Detection (VAD) and Speaker Diarization.

In this blog, we’ll break down the core concepts of Voice Activity Detection (VAD) and Speaker Diarization. Using powerful tools like Faster-Whisper and Resemblyzer, we'll implement basic code examples to show how you can detect speech segments and identify who is speaking step-by-step.

What is Voice Activity Detection (VAD)?

Voice Activity Detection is a signal processing technique used to identify the presence or absence of human speech in an audio signal.

It detects whether the person is speaking or if there is silence. It distinguishes between speech segments and non-speech segments(like silence, background noise or music), allowing systems to focus only on the parts of audio that contain spoken words.

Advantages of VAD (Voice Activity Detection)

Eliminates Non-Speech Audio: Removes silence and background noise to ensure only spoken sounds are processed.

Increased Accuracy: Cleans non-speech audio from input to speech recognition systems, thus reducing the risk of transcribing errors.

Increased Efficiency: Only the speech portions of the audio are processed, reducing computational load and energy usage especially important for real-time systems.

Reduced Latency: Detection of speech audio is instantaneous, providing for a more rapid response in voice assistants or live situations.

Disadvantages of Voice Activity Detection (VAD)

False Positives in Noisy Environments: It can confuse non-speech with speech in crowded or noisy environments, thus leading to wasted processing.

Missed Speech in Low-Volume or Whispered Input: Quiet or whispered speech can be confused with silence, resulting in parts of the speech being missed.

Speaker Awareness is Limited: The VAD does not identify who is speaking, it only assesses if speech occurs. In order to track speakers, other tools like diarization will be needed.

VAD with Faster Whisper

Whisper by OpenAI, powerful automatic speech recognition (ASR) models, have provided high accuracy and multilingual transcription capabilities. However, Whisper will transcribe everything including silence and background noise. For real-world scenarios, like a meeting or podcast, this may not be desirable.

Faster-Whisper is a fast, optimized version of the Whisper model, but it adds built-in Voice Activity Detection (VAD) to ignore non-speech audio and provide more cleaner, more efficient, good transcription output.

Let's see how to implement this.

Important Libraries

!pip install faster-whisper gradio soundfile resampy numpy

Code Below

from faster_whisper import WhisperModel
import soundfile as sf
import resampy
import numpy as np


# Load the Faster-Whisper Large-v3 model
model = WhisperModel("large-v3", device="cuda", compute_type="float16")


# Load and resample audio to 16kHz mono
audio_data, sr = sf.read("your_audio_file.wav")
if len(audio_data.shape) > 1:
    audio_data = np.mean(audio_data, axis=1)  # Convert stereo to mono
audio_16k = resampy.resample(audio_data, sr, 16000).astype(np.float32)


# Transcribe using Voice Activity Detection (VAD)
segments, info = model.transcribe(
    audio_16k,
    vad_filter=True,
    vad_parameters={
        "threshold": 0.5,
        "min_speech_duration_ms": 250,
        "max_speech_duration_s": 20,
        "min_silence_duration_ms": 200
    }
)


# Format and print the transcript with timestamps
for segment in segments:
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text.strip()}")
Suggested Reads- List of 6 Speech-to-Text Models (Open & Closed Source)

Key Libraries and Model Configuration Used in Faster-Whisper ASR

  • faster_whisper to load and run the ASR model.
  • soundfile and resampy to load and convert audio to 16kHz (required by Whisper).
  • numpy for basic audio processing.

Partner with Us for Success

Experience seamless collaboration and exceptional results.

Used here the "large-v3" version of Whisper for higher accuracy. You can also use "base", "medium", or "small" if you're on a CPU or limited resources.

VAD Parameters

ParameterMeaning

threshold

Confidence threshold (0.0–1.0). Higher = more strict in detecting speech

min_speech_duration_ms

Minimum duration (in ms) to treat as valid speech

max_speech_duration_s

Maximum duration (in sec) allowed in one segment

min_silence_duration_ms

Minimum silence between segments (in ms)

threshold

Meaning

Confidence threshold (0.0–1.0). Higher = more strict in detecting speech

1 of 4

We have covered the practical application of Voice Activity Detection, let’s now dive into Speaker Diarization:

Suggested Reads- 13 Text-to-Speech (TTS) Solutions in 2025

What is Speaker Diarization?

Speaker Diarization is the process of partitioning an audio stream into segments according to who spoke when and what. It is often referred to as the "who spoke when" problem.

This involves:

  • Identifying distinct speakers.
  • Assigning speaker labels like Speaker 1, Speaker 2, etc.
  • Outputting a transcript or timeline showing which speaker spoke at each time

Advantages of Speaker Diarization

Identifies "Who Spoke When": Provides structure by using labels to denote speakers in meetings, interviews, and conversations.

Improves the Readability of Transcriptions: Associates speech with speakers, so that the transcripts are more orderly and succinct.

Enables Speaker-Based Analytics: Permits analysis of participation, speaking time, and interaction types across different contexts (for business, legal, or customer service applications).

Assures Multi-Speakers Applications: Supports the development of smart tools in collaborative environments (group discussions, podcasts, courtrooms).

Disadvantages of Speaker Diarization

Requires High-Quality Audio: Diarization accuracy drops significantly if speech is overlapping or there is a poor-quality recording, leading to misidentifying a speaker or merging segments of speakers. 

May Struggle with Similar Voices: Diarization systems may become confused with speakers who have similar voices, leading to inaccurate segmentation of speakers. 

No Built-in Transcription: Diarization segments speakers but does not transcribe the speech uttered. Therefore, it must be paired with an ASR (Automatic Speech Recognition) system to achieve full transcription capabilities.

Speaker Diarization with Whisper and Resemblyzer

What is Resemblyzer?

Resemblyzer is a Python library that is designed to produce voice embeddings, fixed-length vectors that provide an audio speaker’s unique voice characteristics.

The embeddings can be used for the following:

  • Comparing voice samples from different audio recordings.
  • Clustering speech by the identity of the speaker.
  • Carrying out unsupervised speaker diarization without the need for speaker-labelled data.

How Does Resemblyzer Work?

To understand the pipeline a little better, here is a simplified overview of speaker diarization with Whisper and Resemblyzer:

Partner with Us for Success

Experience seamless collaboration and exceptional results.

1. Audio Input: Load audio and convert to 16 kHz.

2. Transcription: Whisper transcribes the audio and outputs segment timestamps.

3. Embedding Generation: Each segment is sent through Resemblyzer, which extracts a voice embedding.

4. Clustering: The embeddings are clustered (e.g., using K-Means) to group segments of the same speaker together.

5. Labelling: The segments are labelled by speaker (e.g., Speaker 1, Speaker 2).

6. Output: You get a structured transcript that is attributed by the speaker. 

Let's see how to implement this

Important Libraries

!pip install openai-whisper resemblyzer librosa scikit-learn

Code Below:

import whisper
import numpy as np
import librosa
from resemblyzer import VoiceEncoder
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Load models
whisper_model = whisper.load_model("large")
encoder = VoiceEncoder()

# Load audio and run transcription
audio_path = "your_audio_file.wav"
wav, sr = librosa.load(audio_path, sr=16000)
result = whisper_model.transcribe(audio_path)
segments = result.get("segments", [])

# Extract speaker embeddings from each segment
embeddings, valid_segments = [], []
for seg in segments:
    start, end = seg["start"], seg["end"]
    audio_seg = wav[int(start * sr):int(end * sr)]
    if len(audio_seg) > 0:
        emb = encoder.embed_utterance(audio_seg)
        embeddings.append(emb)
        valid_segments.append(seg)

# Check if enough segments are found
if len(embeddings) < 2:
    raise ValueError("Not enough speech segments detected for diarization.")

# Automatically determine optimal number of speakers
best_k, best_score = 2, -1
for k in range(2, min(10, len(embeddings)) + 1):
    kmeans = KMeans(n_clusters=k, random_state=0, n_init="auto")
    labels = kmeans.fit_predict(embeddings)
    score = silhouette_score(embeddings, labels)
    if score > best_score:
        best_k, best_score = k, score

# Final clustering with optimal k
kmeans = KMeans(n_clusters=best_k, random_state=0, n_init="auto")
labels = kmeans.fit_predict(embeddings)

# Explicitly map the first detected speaker as Speaker 1
first_label = labels[0]
label_mapping = {first_label: 1}
current_speaker = 2
for lbl in set(labels):
    if lbl != first_label:
        label_mapping[lbl] = current_speaker
        current_speaker += 1

# Print diarized transcript with timestamps
for seg, label in zip(valid_segments, labels):
    speaker_id = label_mapping[label]
    print(f"Speaker {speaker_id} [{seg['start']:.2f}s - {seg['end']:.2f}s]: {seg['text']}")

Model Configuration and Speaker Diarization with Resemblyzer Approach

  • Used OpenAI Whisper (large) to transcribe audio and produce timestamped speech segments.
  • Used Resemblyzer to extract speaker embeddings from each speech segment.
  • Applied KMeans clustering on the embeddings in order to group speech segments by speaker identity (speaker diarization).
  • Labeled the first detected speaker as Speaker 1, and numbered the remaining detected speakers incrementally (Speaker 2, Speaker 3 and so on) for simplicity and ease so that the speaker may not be assigned arbitrarily.
  • librosa is a popular Python library for audio analysis and music/speech signal processing.It provides easy access to a rich set of audio features useful for modeling speaker identity and voice patterns.
  • The silhouette score measures how well data points fit within their clusters compared to others, helping evaluate clustering quality. It’s used to ensure a clear, meaningful separation key for tasks like speaker diarization.

This pipeline works for transcription tasks with multi-speaker, such as:

  • Meeting transcriptions with labelled speakers
  • Podcast editing with host/guest distinctions
  • Customer service call review for quality monitoring

Output:

Speaker 1 [0.00s - 2.34s]: Hello, how are you?
Speaker 2 [2.34s - 4.12s]: I'm good, thanks!

Conclusion

Voice Activity Detection (VAD) and Speaker Diarization are crucial building blocks for intelligent audio applications for human awareness. VAD eliminates silence or noise so applications can just pay attention to spoken content. Diarization determines who was speaking at the same time, as well organizing the conversation with labels for each speaker.

The integration of VAD and Speaker Diarization transforms raw audio into actionable data. By detecting when speech occurs and identifying who is speaking, these techniques form a powerful duo that enables accurate transcription, speaker tracking, and smarter audio analysis in any multi-speaker environment.

Author-Shubhanshu Navadiya
Shubhanshu Navadiya

Passionate about AI and machine learning innovations, exploring the future of technology and its impact on society. Join me on this journey of discovery and growth.

Phone

Next for you

Qdrant vs Milvus: Which Vector Database Should You Choose? Cover

AI

Jul 18, 20259 min read

Qdrant vs Milvus: Which Vector Database Should You Choose?

Which vector database should you choose for your AI-powered application, Qdrant or Milvus? As the need for high-dimensional data storage grows in modern AI use cases like semantic search, recommendation systems, and Retrieval-Augmented Generation (RAG), vector databases have become essential.  In this article, we compare Qdrant vs Milvus, two of the most popular vector databases, based on architecture, performance, and ideal use cases. You’ll get a practical breakdown of insertion speed, query

Voxtral-Mini 3B vs Whisper Large V3: Which One’s Faster? Cover

AI

Jul 18, 20254 min read

Voxtral-Mini 3B vs Whisper Large V3: Which One’s Faster?

Which speech-to-text model delivers faster and more accurate transcriptions Voxtral-Mini 3B or Whisper Large V3? We put Voxtral-Mini 3B and Whisper Large V3 head-to-head to find out which speech-to-text model performs better in real-world tasks. Using the same audio clips, we compared latency (speed) and word error rate (accuracy) to help you choose the right model for use cases like transcribing calls, meetings, or voice messages. As speech-to-text systems become smarter and more reliable, th

What is Google Gemini CLI & how to install and use it? Cover

AI

Jul 3, 20252 min read

What is Google Gemini CLI & how to install and use it?

Ever wish your terminal could help you debug, write code, or even run DevOps tasks, without switching tabs? Google’s new Gemini CLI might just do that. Launched in June 2025, Gemini CLI is an open-source command-line AI tool designed to act like your AI teammate, helping you write, debug, and understand code right from the command line. What is Gemini CLI? Gemini CLI is a smart AI assistant you can use directly in your terminal. It’s not just for chatting, it’s purpose-built for developers.