Have you ever interacted with a voice assistant and been impressed by how it knows when to start listening to you just as you are about to say something, and does not start listening too early or too late? Or listened to a transcript of a podcast where it correctly identifies multiple speakers, without any speaker already identified?
These are typical cases of audio processing where two competencies of audio AI come together: Voice Activity Detection (VAD) and Speaker Diarization.
In this blog, we’ll break down the core concepts of Voice Activity Detection (VAD) and Speaker Diarization. Using powerful tools like Faster-Whisper and Resemblyzer, we'll implement basic code examples to show how you can detect speech segments and identify who is speaking step-by-step.
Voice Activity Detection is a signal processing technique used to identify the presence or absence of human speech in an audio signal.
It detects whether the person is speaking or if there is silence. It distinguishes between speech segments and non-speech segments(like silence, background noise or music), allowing systems to focus only on the parts of audio that contain spoken words.
Eliminates Non-Speech Audio: Removes silence and background noise to ensure only spoken sounds are processed.
Increased Accuracy: Cleans non-speech audio from input to speech recognition systems, thus reducing the risk of transcribing errors.
Increased Efficiency: Only the speech portions of the audio are processed, reducing computational load and energy usage especially important for real-time systems.
Reduced Latency: Detection of speech audio is instantaneous, providing for a more rapid response in voice assistants or live situations.
False Positives in Noisy Environments: It can confuse non-speech with speech in crowded or noisy environments, thus leading to wasted processing.
Missed Speech in Low-Volume or Whispered Input: Quiet or whispered speech can be confused with silence, resulting in parts of the speech being missed.
Speaker Awareness is Limited: The VAD does not identify who is speaking, it only assesses if speech occurs. In order to track speakers, other tools like diarization will be needed.
Whisper by OpenAI, powerful automatic speech recognition (ASR) models, have provided high accuracy and multilingual transcription capabilities. However, Whisper will transcribe everything including silence and background noise. For real-world scenarios, like a meeting or podcast, this may not be desirable.
Faster-Whisper is a fast, optimized version of the Whisper model, but it adds built-in Voice Activity Detection (VAD) to ignore non-speech audio and provide more cleaner, efficient, good transcription output.
Important Libraries
!pip install faster-whisper gradio soundfile resampy numpy
Code Below
from faster_whisper import WhisperModel
import soundfile as sf
import resampy
import numpy as np
# Load the Faster-Whisper Large-v3 model
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
# Load and resample audio to 16kHz mono
audio_data, sr = sf.read("your_audio_file.wav")
if len(audio_data.shape) > 1:
audio_data = np.mean(audio_data, axis=1) # Convert stereo to mono
audio_16k = resampy.resample(audio_data, sr, 16000).astype(np.float32)
# Transcribe using Voice Activity Detection (VAD)
segments, info = model.transcribe(
audio_16k,
vad_filter=True,
vad_parameters={
"threshold": 0.5,
"min_speech_duration_ms": 250,
"max_speech_duration_s": 20,
"min_silence_duration_ms": 200
}
)
# Format and print the transcript with timestamps
for segment in segments:
print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text.strip()}")
Suggested Reads- List of 6 Speech-to-Text Models (Open & Closed Source)
Experience seamless collaboration and exceptional results.
Used here the "large-v3" version of Whisper for higher accuracy. You can also use "base", "medium", or "small" if you're on a CPU or limited resources.
Parameter | Meaning |
threshold | Confidence threshold (0.0–1.0). Higher = more strict in detecting speech |
min_speech_duration_ms | Minimum duration (in ms) to treat as valid speech |
max_speech_duration_s | Maximum duration (in sec) allowed in one segment |
min_silence_duration_ms | Minimum silence between segments (in ms) |
We have covered the practical application of Voice Activity Detection, let’s now dive into Speaker Diarization:
Suggested Reads- 13 Text-to-Speech (TTS) Solutions in 2025
Speaker Diarization is the process of partitioning an audio stream into segments according to who spoke when and what. It is often referred to as the "who spoke when" problem.
This involves:
Identifies "Who Spoke When": Provides structure by using labels to denote speakers in meetings, interviews, and conversations.
Improves the Readability of Transcriptions: Associates speech with speakers, so that the transcripts are more orderly and succinct.
Enables Speaker-Based Analytics: Permits analysis of participation, speaking time, and interaction types across different contexts (for business, legal, or customer service applications).
Assures Multi-Speakers Applications: Supports the development of smart tools in collaborative environments (group discussions, podcasts, courtrooms).
Requires High-Quality Audio: Diarization accuracy drops significantly if speech is overlapping or there is a poor-quality recording, leading to misidentifying a speaker or merging segments of speakers.
May Struggle with Similar Voices: Diarization systems may become confused with speakers who have similar voices, leading to inaccurate segmentation of speakers.
No Built-in Transcription: Diarization segments speakers but does not transcribe the speech uttered. Therefore, it must be paired with an ASR (Automatic Speech Recognition) system to achieve full transcription capabilities.
Resemblyzer is a Python library that is designed to produce voice embeddings, fixed-length vectors that provide an audio speaker’s unique voice characteristics.
The embeddings can be used for the following:
To understand the pipeline a little better, here is a simplified overview of speaker diarization with Whisper and Resemblyzer:
Experience seamless collaboration and exceptional results.
1. Audio Input: Load audio and convert to 16 kHz.
2. Transcription: Whisper transcribes the audio and outputs segment timestamps.
3. Embedding Generation: Each segment is sent through Resemblyzer, which extracts a voice embedding.
4. Clustering: The embeddings are clustered (e.g., using K-Means) to group segments of the same speaker together.
5. Labelling: The segments are labelled by speaker (e.g., Speaker 1, Speaker 2).
6. Output: You get a structured transcript that is attributed by the speaker.
Important Libraries
!pip install openai-whisper resemblyzer librosa scikit-learn
Code Below:
import whisper
import numpy as np
import librosa
from resemblyzer import VoiceEncoder
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Load models
whisper_model = whisper.load_model("large")
encoder = VoiceEncoder()
# Load audio and run transcription
audio_path = "your_audio_file.wav"
wav, sr = librosa.load(audio_path, sr=16000)
result = whisper_model.transcribe(audio_path)
segments = result.get("segments", [])
# Extract speaker embeddings from each segment
embeddings, valid_segments = [], []
for seg in segments:
start, end = seg["start"], seg["end"]
audio_seg = wav[int(start * sr):int(end * sr)]
if len(audio_seg) > 0:
emb = encoder.embed_utterance(audio_seg)
embeddings.append(emb)
valid_segments.append(seg)
# Check if enough segments are found
if len(embeddings) < 2:
raise ValueError("Not enough speech segments detected for diarization.")
# Automatically determine optimal number of speakers
best_k, best_score = 2, -1
for k in range(2, min(10, len(embeddings)) + 1):
kmeans = KMeans(n_clusters=k, random_state=0, n_init="auto")
labels = kmeans.fit_predict(embeddings)
score = silhouette_score(embeddings, labels)
if score > best_score:
best_k, best_score = k, score
# Final clustering with optimal k
kmeans = KMeans(n_clusters=best_k, random_state=0, n_init="auto")
labels = kmeans.fit_predict(embeddings)
# Explicitly map the first detected speaker as Speaker 1
first_label = labels[0]
label_mapping = {first_label: 1}
current_speaker = 2
for lbl in set(labels):
if lbl != first_label:
label_mapping[lbl] = current_speaker
current_speaker += 1
# Print diarized transcript with timestamps
for seg, label in zip(valid_segments, labels):
speaker_id = label_mapping[label]
print(f"Speaker {speaker_id} [{seg['start']:.2f}s - {seg['end']:.2f}s]: {seg['text']}")
This pipeline works for transcription tasks with multi-speaker, such as:
Output:
Speaker 1 [0.00s - 2.34s]: Hello, how are you?
Speaker 2 [2.34s - 4.12s]: I'm good, thanks!
Voice Activity Detection (VAD) and Speaker Diarization are crucial building blocks for intelligent audio applications for human awareness. VAD eliminates silence or noise so applications can just pay attention to spoken content. Diarization determines who was speaking at the same time, as well organizing the conversation with labels for each speaker.
The integration of VAD and Speaker Diarization transforms raw audio into actionable data. By detecting when speech occurs and identifying who is speaking, these techniques form a powerful duo that enables accurate transcription, speaker tracking, and smarter audio analysis in any multi-speaker environment.