What Is Speaker Diarization and How Does It Work
Speaker diarization is the technology that answers the question "who spoke when." It segments an audio recording into portions belonging to different speakers and labels each segment accordingly. In this article, we will explore how speaker diarization works under the hood, what algorithms power it, where it is applied, and what limitations remain.
What Is Speaker Diarization
Imagine you have a one-hour recording of a meeting with five participants. A speech recognition service will turn the audio into text, but you will get a continuous stream of words with no indication of who said what. Speaker diarization solves precisely this problem — it determines who was speaking at each moment in time.
It is important to distinguish three related technologies:
- Automatic Speech Recognition (ASR) — converts sound into text. Answers the question "what was said?"
- Speaker diarization — splits audio by speaker. Answers the question "who spoke when?"
- Speaker identification — determines a specific person by their voice. Answers the question "is this Ivan Petrov's voice?"
Diarization does not know names — it simply assigns labels: Speaker 1, Speaker 2, Speaker 3. But combined with voice profiles (more on that below), labels can be replaced with real names.
A practical example: you recorded a meeting where a project budget was being discussed. Without diarization, you see just text. With diarization — a structured dialogue:
Speaker 1 (00:00–00:45): I suggest we increase the marketing budget by 20%. Speaker 2 (00:46–01:12): I disagree. Let us look at the results of the current campaign first. Speaker 3 (01:13–01:40): I can have the report ready by Friday.
Now it is clear not only what was discussed, but who took which position.
Why Speaker Diarization Matters
Speaker separation is critically important across dozens of scenarios. Here are the main ones:
Meeting Minutes
The most widespread use case. When 5–10 people join a meeting, without diarization it is impossible to tell who made a decision, who objected, or who took on a task. Minutes without names are a useless transcript.
Interviews and Journalism
A journalist needs to clearly separate their own questions from the respondent's answers. Manually splitting a two-hour interview transcript takes hours. Diarization handles it automatically.
Podcasts
The host and guest (or multiple guests) must be clearly separated — for creating transcripts, subtitles, pull quotes, and SEO-optimized episode descriptions.
Court Proceedings
The judge, prosecutor, defense attorney, defendant, witnesses — every statement must be accurately attributed. A misattribution could affect a court ruling.
Medical Consultations
A conversation between doctor and patient: who described the symptoms, who prescribed the treatment. This is essential for medical documentation and insurance records.
Call Centers
Agent vs. customer. Diarization enables quality-of-service analysis, response time measurement, and script compliance monitoring. Companies process thousands of calls daily — manual annotation is not feasible.
Education
Lectures with student questions: separating the instructor's speech from audience questions. Useful for creating educational materials.
How Diarization Works: A Technical Deep Dive
Speaker diarization is a pipeline of several sequential stages. Each stage addresses its own task, and the quality of each affects the final result.
Stage 1: Voice Activity Detection (VAD)
The first step is to determine where speech actually exists in the audio. A recording contains silence, background noise, music, keyboard clicks, and other non-speech sounds. VAD (Voice Activity Detection) separates audio into segments with and without speech.
Modern approaches to VAD:
- Silero VAD — a neural network model that is compact and fast. Runs on CPU in real time. Used in most modern pipelines.
- WebRTC VAD — a classic algorithm from Google's WebRTC project. Fast but less accurate in noisy conditions.
- Energy-based methods — the simplest approach: if the signal amplitude is above a threshold, someone is speaking. Unreliable in real-world conditions.
The output of VAD is a set of timestamps for speech segments: [(0.5s–3.2s), (4.1s–7.8s), (8.5s–12.0s), ...].
Stage 2: Segmentation
Next, the speech segments need to be divided into homogeneous chunks — so that each chunk belongs to a single speaker.
The key task is Speaker Change Detection. The algorithm looks for moments when one voice gives way to another. This is a challenging task because:
- The switch can be instantaneous (interruption)
- There may be a pause between turns
- A single speaker can change intonation, volume, and tempo
Modern systems (such as pyannote.audio) use neural models trained to detect segment boundaries with an accuracy of 200–500 milliseconds.
Stage 3: Embedding Extraction
This is the crucial stage. For each speech segment, a neural network computes a voice embedding — a numerical vector that serves as a kind of "voice fingerprint."
What an embedding encodes:
- Timbre — the unique "color" of the sound, determined by the anatomy of the vocal tract
- Pitch — the fundamental frequency (F0) of the voice
- Speaking style — speed, intonation patterns, pronunciation habits
- Acoustic characteristics — formant frequencies, spectral envelope
Neural networks for extracting embeddings:
- ECAPA-TDNN — one of the most popular architectures. Uses attention mechanisms and multi-level feature aggregation. The standard in pyannote.audio.
- TitaNet — developed by NVIDIA. High accuracy, optimized for GPUs.
- WavLM — a transformer-based model from Microsoft. Pretrained on a massive corpus, delivers state-of-the-art results.
- ResNet-based — classic convolutional networks adapted for audio.
A typical embedding is a vector of 192–512 numbers. Two segments from the same speaker will have similar embeddings (close vectors), while segments from different speakers will be far apart.
Stage 4: Clustering
With embeddings for all segments in hand, the next step is to group them by speaker. This is a clustering problem — a classic machine learning task.
Main algorithms:
- Agglomerative Clustering (hierarchical clustering) — starts with the assumption that each segment is a separate speaker, then progressively merges the most similar ones. The most common approach in diarization.
- Spectral Clustering — builds a similarity graph between segments and looks for an optimal partition. Works well when the number of speakers is known in advance.
- K-Means — fast, but requires the number of clusters to be specified upfront.
- HDBSCAN — automatically determines the number of clusters and is robust to noise.
A separate challenge is determining the number of speakers. If it is known in advance (e.g., "there were 2 participants on the call"), the task is simplified. If not, the algorithm must determine it on its own, using metrics like BIC (Bayesian Information Criterion) or silhouette score.
Stage 5: Final Labeling
In the final stage, each segment is assigned a speaker label. The result is a time-aligned annotation:
- 00:00–00:45 → Speaker 1
- 00:46–01:12 → Speaker 2
- 01:13–01:40 → Speaker 3
- 01:41–02:05 → Speaker 1
A separate complexity is handling overlapping speech. When two people talk simultaneously, a single segment must be labeled with two tags. Modern systems (pyannote.audio 3.x) can handle overlaps using specialized segmentation models trained on multi-channel microphone data.
Diarization Quality Metrics
How do you evaluate how well diarization is performing? The standard metric is DER (Diarization Error Rate).
DER is composed of three components:
- Missed Speech — speech the system failed to detect (skipped)
- False Alarm — silence or noise incorrectly labeled as speech
- Speaker Confusion — speech correctly detected but attributed to the wrong speaker
Formula: DER = (missed + false alarm + confusion) / total speech duration
Current state-of-the-art results:
- Clean recordings (studio quality): DER 3–8%
- Meetings (single microphone): DER 8–15%
- Teleconferences: DER 12–25%
- Cocktail party (many speakers, noise): DER 20–40%
For most practical tasks, a DER below 10% is considered a good result. For a deeper look at accuracy benchmarks including WER (Word Error Rate), see our transcription market guide.
Speaker Profiles: The Next Level
Standard diarization assigns impersonal labels: Speaker 1, Speaker 2. But what if the system could recognize a familiar voice?
Voice embeddings extracted during diarization can be saved as a speaker profile. When processing a new recording, the system compares the embeddings of new segments against saved profiles and automatically substitutes names.
Diktovka supports this feature — voice profiles. During the first recording, the system creates an embedding for each new speaker and offers to assign a name. In subsequent recordings, Diktovka automatically recognizes the voice and fills in the saved name.
Embeddings are compared using cosine similarity. Two vectors are considered to belong to the same person if cosine similarity >= 0.75. This threshold provides a balance between precision (not confusing different people) and recall (recognizing the same person under different recording conditions).
Speaker profiles are especially useful for:
- Regular meetings — a team of 5–7 people meets every week. The system knows every participant.
- Podcasts — the host and regular co-hosts are recognized automatically; only guests are marked as new speakers.
- Medical practice — a doctor records appointments; their voice is recognized automatically, while patient voices are new every time.
Limitations and Challenges
Diarization is impressive technology, but it is far from perfect. Here are the main challenges:
Overlapping Speech
When two or more people speak at the same time, it is extremely difficult for the algorithm to separate voices. This is the most common source of errors in real meetings, especially during heated discussions.
Similar Voices
If a recording involves people with very similar voices (a same-gender group of similar age, twins), the embeddings may be too similar, and the algorithm will confuse the speakers.
Noisy Environments
Background noise (cafes, streets, ventilation) degrades embedding quality and complicates VAD. Non-stationary noises — claps, sirens, music — are especially problematic.
Telephone Audio
Telephone channels transmit frequencies only in the 300–3,400 Hz range (wideband audio: 50–8,000 Hz and above). This strips away acoustic information and reduces embedding accuracy.
Unknown Number of Speakers
When the algorithm does not know in advance how many people participated in the recording, it can make mistakes: merging two similar speakers into one, or splitting a single speaker into two.
Short Utterances
A quality embedding requires at least 1–2 seconds of speech. Short utterances ("Yes," "No," "Agreed") do not contain enough information for reliable identification.
Tools with Diarization Support
| Tool | Technology | Max Speakers | Accuracy | Price |
|---|---|---|---|---|
| Diktovka | Whisper + pyannote | Unlimited | High (DER ~8–12%) | Free (beta) |
| Otter.ai | Proprietary | Up to 10 | High | From $16.99/mo |
| AssemblyAI | Proprietary | Unlimited | Very high | From $0.65/hr |
| Deepgram | Proprietary | Unlimited | High | From $0.25/hr |
| Rev | Human + AI | Unlimited | Highest | From $1.50/min |
| pyannote.audio | Open-source | Unlimited | High | Free |
Diktovka uses a combination of Whisper (for speech recognition) and pyannote (for diarization) with an additional voice profiles feature. This allows it not only to separate speakers but also to recognize them in new recordings — a unique capability among free tools. For a detailed review of transcription apps with diarization support, see our comparison of transcription applications.
The Future of Diarization
The technology is actively evolving. Here are the key directions:
Real-Time Diarization
Today, most systems work in batch mode — the entire recording is processed first, then the result is delivered. The future lies in streaming diarization in real time, where speaker labels appear with a delay of just 1–2 seconds. This is critically important for live subtitles at conferences and video calls.
Multimodal Diarization
Why rely on audio alone when video is available? Combining audio embeddings with visual information (face recognition, lip movement tracking) significantly improves accuracy. This is especially useful for overlapping speech — the camera shows who is moving their lips.
Personalization Through Profiles
Systems will store more and more profiles and use them not only for identification but also for adapting the model to specific speakers — accounting for their accent, speech tempo, and vocabulary.
Better Overlap Handling
The weakest point of modern diarization is overlapping speech. New models (multi-speaker ASR, target speaker extraction) are learning to separate overlaid voices with growing accuracy.
End-to-End Models
There is a trend toward unifying all stages (VAD, segmentation, embeddings, clustering) into a single model trained end to end. Such systems are simpler to deploy and potentially more accurate, because stages do not lose information when passing data between one another.
Conclusion
Speaker diarization transforms a faceless stream of text into a structured dialogue with attribution for every utterance. Behind the simple idea of "who spoke when" lies a sophisticated pipeline of speech detection, segmentation, voice fingerprint extraction, and clustering.
The technology is already mature enough for practical use — a DER of 5–15% covers most scenarios. And combined with speaker profiles, which Diktovka supports, the system does not just separate voices but also recognizes familiar people in new recordings.
If you work with recordings of meetings, interviews, or podcasts — diarization saves hours of manual annotation and turns audio into a truly useful document. If privacy of your audio data is a concern, read our guide on local vs cloud transcription.
FAQ
What is speaker diarization?
Speaker diarization is a technology that determines who was speaking at each moment of an audio recording. It segments the recording into portions belonging to different speakers and labels them — Speaker 1, Speaker 2, and so on.
How accurate is automatic diarization?
On clean studio recordings, DER (Diarization Error Rate) is 3–8%. On single-microphone meeting recordings — 8–15%. On teleconferences — 12–25%. For most practical tasks, a DER below 10% is considered a good result.
How many speakers can diarization detect?
Modern diarization systems (such as pyannote.audio) have no hard limit on the number of speakers. However, accuracy decreases with a large number of participants, especially if voices are similar or people speak simultaneously.
What tools support speaker diarization?
Free: Diktovka (Whisper + pyannote, with voice profiles) and pyannote.audio (open-source library). Paid: Otter.ai, AssemblyAI, Deepgram, Rev. Diktovka is the only free service with automatic familiar voice recognition.
How is diarization different from speech recognition?
Speech recognition (ASR) answers the question 'what was said' — it converts audio to text. Diarization answers the question 'who spoke when' — it splits audio by speaker. These are different technologies that work together to create structured transcripts.