All articles

What Is Speaker Diarization and How Does It Work

·18 min read

Speaker diarization is the technology that answers the question "who spoke when." It segments an audio recording into portions belonging to different speakers and labels each segment accordingly. In this article, we will explore how speaker diarization works under the hood, what algorithms power it, where it is applied, and what limitations remain.


What Is Speaker Diarization

Imagine you have a one-hour recording of a meeting with five participants. A speech recognition service will turn the audio into text, but you will get a continuous stream of words with no indication of who said what. Speaker diarization solves precisely this problem — it determines who was speaking at each moment in time.

It is important to distinguish three related technologies:

Diarization does not know names — it simply assigns labels: Speaker 1, Speaker 2, Speaker 3. But combined with voice profiles (more on that below), labels can be replaced with real names.

A practical example: you recorded a meeting where a project budget was being discussed. Without diarization, you see just text. With diarization — a structured dialogue:

Speaker 1 (00:00–00:45): I suggest we increase the marketing budget by 20%. Speaker 2 (00:46–01:12): I disagree. Let us look at the results of the current campaign first. Speaker 3 (01:13–01:40): I can have the report ready by Friday.

Now it is clear not only what was discussed, but who took which position.


Why Speaker Diarization Matters

Speaker separation is critically important across dozens of scenarios. Here are the main ones:

Meeting Minutes

The most widespread use case. When 5–10 people join a meeting, without diarization it is impossible to tell who made a decision, who objected, or who took on a task. Minutes without names are a useless transcript.

Interviews and Journalism

A journalist needs to clearly separate their own questions from the respondent's answers. Manually splitting a two-hour interview transcript takes hours. Diarization handles it automatically.

Podcasts

The host and guest (or multiple guests) must be clearly separated — for creating transcripts, subtitles, pull quotes, and SEO-optimized episode descriptions.

Court Proceedings

The judge, prosecutor, defense attorney, defendant, witnesses — every statement must be accurately attributed. A misattribution could affect a court ruling.

Medical Consultations

A conversation between doctor and patient: who described the symptoms, who prescribed the treatment. This is essential for medical documentation and insurance records.

Call Centers

Agent vs. customer. Diarization enables quality-of-service analysis, response time measurement, and script compliance monitoring. Companies process thousands of calls daily — manual annotation is not feasible.

Education

Lectures with student questions: separating the instructor's speech from audience questions. Useful for creating educational materials.


How Diarization Works: A Technical Deep Dive

Speaker diarization is a pipeline of several sequential stages. Each stage addresses its own task, and the quality of each affects the final result.

Stage 1: Voice Activity Detection (VAD)

The first step is to determine where speech actually exists in the audio. A recording contains silence, background noise, music, keyboard clicks, and other non-speech sounds. VAD (Voice Activity Detection) separates audio into segments with and without speech.

Modern approaches to VAD:

The output of VAD is a set of timestamps for speech segments: [(0.5s–3.2s), (4.1s–7.8s), (8.5s–12.0s), ...].

Stage 2: Segmentation

Next, the speech segments need to be divided into homogeneous chunks — so that each chunk belongs to a single speaker.

The key task is Speaker Change Detection. The algorithm looks for moments when one voice gives way to another. This is a challenging task because:

Modern systems (such as pyannote.audio) use neural models trained to detect segment boundaries with an accuracy of 200–500 milliseconds.

Stage 3: Embedding Extraction

This is the crucial stage. For each speech segment, a neural network computes a voice embedding — a numerical vector that serves as a kind of "voice fingerprint."

What an embedding encodes:

Neural networks for extracting embeddings:

A typical embedding is a vector of 192–512 numbers. Two segments from the same speaker will have similar embeddings (close vectors), while segments from different speakers will be far apart.

Stage 4: Clustering

With embeddings for all segments in hand, the next step is to group them by speaker. This is a clustering problem — a classic machine learning task.

Main algorithms:

A separate challenge is determining the number of speakers. If it is known in advance (e.g., "there were 2 participants on the call"), the task is simplified. If not, the algorithm must determine it on its own, using metrics like BIC (Bayesian Information Criterion) or silhouette score.

Stage 5: Final Labeling

In the final stage, each segment is assigned a speaker label. The result is a time-aligned annotation:

A separate complexity is handling overlapping speech. When two people talk simultaneously, a single segment must be labeled with two tags. Modern systems (pyannote.audio 3.x) can handle overlaps using specialized segmentation models trained on multi-channel microphone data.


Diarization Quality Metrics

How do you evaluate how well diarization is performing? The standard metric is DER (Diarization Error Rate).

DER is composed of three components:

Formula: DER = (missed + false alarm + confusion) / total speech duration

Current state-of-the-art results:

For most practical tasks, a DER below 10% is considered a good result. For a deeper look at accuracy benchmarks including WER (Word Error Rate), see our transcription market guide.


Speaker Profiles: The Next Level

Standard diarization assigns impersonal labels: Speaker 1, Speaker 2. But what if the system could recognize a familiar voice?

Voice embeddings extracted during diarization can be saved as a speaker profile. When processing a new recording, the system compares the embeddings of new segments against saved profiles and automatically substitutes names.

Diktovka supports this feature — voice profiles. During the first recording, the system creates an embedding for each new speaker and offers to assign a name. In subsequent recordings, Diktovka automatically recognizes the voice and fills in the saved name.

Embeddings are compared using cosine similarity. Two vectors are considered to belong to the same person if cosine similarity >= 0.75. This threshold provides a balance between precision (not confusing different people) and recall (recognizing the same person under different recording conditions).

Speaker profiles are especially useful for:


Limitations and Challenges

Diarization is impressive technology, but it is far from perfect. Here are the main challenges:

Overlapping Speech

When two or more people speak at the same time, it is extremely difficult for the algorithm to separate voices. This is the most common source of errors in real meetings, especially during heated discussions.

Similar Voices

If a recording involves people with very similar voices (a same-gender group of similar age, twins), the embeddings may be too similar, and the algorithm will confuse the speakers.

Noisy Environments

Background noise (cafes, streets, ventilation) degrades embedding quality and complicates VAD. Non-stationary noises — claps, sirens, music — are especially problematic.

Telephone Audio

Telephone channels transmit frequencies only in the 300–3,400 Hz range (wideband audio: 50–8,000 Hz and above). This strips away acoustic information and reduces embedding accuracy.

Unknown Number of Speakers

When the algorithm does not know in advance how many people participated in the recording, it can make mistakes: merging two similar speakers into one, or splitting a single speaker into two.

Short Utterances

A quality embedding requires at least 1–2 seconds of speech. Short utterances ("Yes," "No," "Agreed") do not contain enough information for reliable identification.


Tools with Diarization Support

ToolTechnologyMax SpeakersAccuracyPrice
DiktovkaWhisper + pyannoteUnlimitedHigh (DER ~8–12%)Free (beta)
Otter.aiProprietaryUp to 10HighFrom $16.99/mo
AssemblyAIProprietaryUnlimitedVery highFrom $0.65/hr
DeepgramProprietaryUnlimitedHighFrom $0.25/hr
RevHuman + AIUnlimitedHighestFrom $1.50/min
pyannote.audioOpen-sourceUnlimitedHighFree

Diktovka uses a combination of Whisper (for speech recognition) and pyannote (for diarization) with an additional voice profiles feature. This allows it not only to separate speakers but also to recognize them in new recordings — a unique capability among free tools. For a detailed review of transcription apps with diarization support, see our comparison of transcription applications.


The Future of Diarization

The technology is actively evolving. Here are the key directions:

Real-Time Diarization

Today, most systems work in batch mode — the entire recording is processed first, then the result is delivered. The future lies in streaming diarization in real time, where speaker labels appear with a delay of just 1–2 seconds. This is critically important for live subtitles at conferences and video calls.

Multimodal Diarization

Why rely on audio alone when video is available? Combining audio embeddings with visual information (face recognition, lip movement tracking) significantly improves accuracy. This is especially useful for overlapping speech — the camera shows who is moving their lips.

Personalization Through Profiles

Systems will store more and more profiles and use them not only for identification but also for adapting the model to specific speakers — accounting for their accent, speech tempo, and vocabulary.

Better Overlap Handling

The weakest point of modern diarization is overlapping speech. New models (multi-speaker ASR, target speaker extraction) are learning to separate overlaid voices with growing accuracy.

End-to-End Models

There is a trend toward unifying all stages (VAD, segmentation, embeddings, clustering) into a single model trained end to end. Such systems are simpler to deploy and potentially more accurate, because stages do not lose information when passing data between one another.


Conclusion

Speaker diarization transforms a faceless stream of text into a structured dialogue with attribution for every utterance. Behind the simple idea of "who spoke when" lies a sophisticated pipeline of speech detection, segmentation, voice fingerprint extraction, and clustering.

The technology is already mature enough for practical use — a DER of 5–15% covers most scenarios. And combined with speaker profiles, which Diktovka supports, the system does not just separate voices but also recognizes familiar people in new recordings.

If you work with recordings of meetings, interviews, or podcasts — diarization saves hours of manual annotation and turns audio into a truly useful document. If privacy of your audio data is a concern, read our guide on local vs cloud transcription.

FAQ

What is speaker diarization?

Speaker diarization is a technology that determines who was speaking at each moment of an audio recording. It segments the recording into portions belonging to different speakers and labels them — Speaker 1, Speaker 2, and so on.

How accurate is automatic diarization?

On clean studio recordings, DER (Diarization Error Rate) is 3–8%. On single-microphone meeting recordings — 8–15%. On teleconferences — 12–25%. For most practical tasks, a DER below 10% is considered a good result.

How many speakers can diarization detect?

Modern diarization systems (such as pyannote.audio) have no hard limit on the number of speakers. However, accuracy decreases with a large number of participants, especially if voices are similar or people speak simultaneously.

What tools support speaker diarization?

Free: Diktovka (Whisper + pyannote, with voice profiles) and pyannote.audio (open-source library). Paid: Otter.ai, AssemblyAI, Deepgram, Rev. Diktovka is the only free service with automatic familiar voice recognition.

How is diarization different from speech recognition?

Speech recognition (ASR) answers the question 'what was said' — it converts audio to text. Diarization answers the question 'who spoke when' — it splits audio by speaker. These are different technologies that work together to create structured transcripts.