WER (Word Error Rate): How Speech Recognition Accuracy Is Measured

March 28, 2026·14 min read

Word Error Rate (WER) is the gold standard metric for evaluating speech recognition quality. We break down the formula, walk through real examples, explain what different WER values mean in practice, and cover the factors that make or break transcription accuracy. If you have ever wondered why one transcription service produces near-perfect text while another delivers garbled nonsense, the answer almost always comes down to three letters: WER.

What Is WER

Word Error Rate (WER) is the standard metric used to measure the accuracy of automatic speech recognition (ASR) systems. In simple terms, WER tells you what percentage of words the system got wrong.

The concept is straightforward: take a reference transcript (what was actually said), compare it to the system output (what the ASR produced), and count the errors. The lower the WER, the better the recognition.

WER is used everywhere — in academic papers, API documentation for speech recognition services, model comparison benchmarks, and product evaluations. It is the lingua franca of the ASR industry, the common language shared by researchers, developers, and end users alike.

The WER Formula

The WER formula is:

WER = (S + D + I) / N x 100%

Where:

S (Substitutions) — words the system recognized incorrectly, replacing one word with another
D (Deletions) — words from the reference that the system missed entirely
I (Insertions) — words the system added that were not in the original
N — the total number of words in the reference transcript

Notice that the numerator contains three types of errors, while the denominator is only the reference word count. This means WER can theoretically exceed 100% (if there are many insertions), though this is rare in practice.

How WER Is Calculated: A Worked Example

Let us walk through a concrete example.

Reference (what was actually said): "I want to book a train ticket"

ASR output: "I want to book a plane ticket"

Comparing word by word:

Position	Reference	Recognized	Error Type
1	I	I	Correct
2	want	want	Correct
3	to	to	Correct
4	book	book	Correct
5	a	a	Correct
6	train	plane	Substitution (S)
7	ticket	ticket	Correct

Result:

S = 1 (one substitution: "train" replaced by "plane")
D = 0 (nothing was deleted)
I = 0 (nothing was inserted)
N = 7 (seven words in the reference)

WER = (1 + 0 + 0) / 7 x 100% = 14.3%

Now consider a more complex example with all three error types:

Reference: "The meeting will take place tomorrow at ten in the morning"

ASR output: "The meeting will take place at ten o'clock in the morning"

Position	Reference	Recognized	Error Type
1	The	The	Correct
2	meeting	meeting	Correct
3	will	will	Correct
4	take	take	Correct
5	place	place	Correct
6	tomorrow	—	Deletion (D)
7	at	at	Correct
8	ten	ten	Correct
9	—	o'clock	Insertion (I)
10	in	in	Correct
11	the	the	Correct
12	morning	morning	Correct

S = 0, D = 1 ("tomorrow" was deleted), I = 1 ("o'clock" was inserted), N = 10

WER = (0 + 1 + 1) / 10 x 100% = 20%

Note something important: the system dropped "tomorrow" — a word that carries critical meaning — while the inserted "o'clock" is essentially harmless. WER treats both errors equally, which is one of its known limitations.

What Different WER Values Mean

Not all WER values have the same practical impact. Here is a general scale:

WER	Quality	Practical Meaning
Below 5%	Excellent	Professional-grade, usable without editing. Publish-ready
5–10%	Good	Minimal editing needed. Suitable for notes, meeting minutes, subtitles
10–20%	Acceptable	Noticeable errors, but the core meaning is clear. Needs significant editing
20–30%	Poor	Requires re-listening and substantial corrections
Above 30%	Unusable	Faster to type from scratch

Context matters enormously. For medical documentation, even 5% WER may be unacceptable — a single wrong drug name is a patient safety issue. For personal voice notes, 15% WER is perfectly fine as long as the main ideas come through.

Factors That Affect WER

Transcription accuracy depends on many factors. Understanding them helps you choose the right tool and prepare your audio for the best possible results.

Audio Quality

This is the single biggest factor — often more impactful than which model you use.

Background noise is the most common accuracy killer. Air conditioning hum, conversations in the next room, street noise, background music — all of these add 5–20 percentage points to WER depending on intensity. A signal-to-noise ratio (SNR) below 10 dB makes transcription essentially useless for most systems.

Microphone quality makes a significant difference. A good external microphone placed close to the speaker can reduce WER by 3–10% compared to a laptop built-in mic at arm's length. Headsets and lavalier mics are a transcriptionist's best friends.

Reverberation and echo add 5–15% to WER. Recording in a large empty room or using speakerphone significantly degrades recognition. Soft surfaces, carpets, curtains — anything that absorbs sound — helps.

Speech Characteristics

Accent and dialect increase WER by 5–15%. Models are trained primarily on standard pronunciation. A strong regional accent, dialect, or non-native speaker accent noticeably reduces accuracy.

Speaking speed at a fast pace adds 3–10% to WER. When people speak rapidly, words blur together, boundaries between them become fuzzy, and models struggle to segment them.

Overlapping speech is the hardest scenario for ASR systems. When two people talk simultaneously, WER can increase by 10–30%. Even models with diarization (speaker separation) handle crosstalk poorly.

Specialized vocabulary — technical jargon, abbreviations, company and product names — adds 5–15% to WER. The model may not know the word "decontamination" or the drug name "Amoxiclav" and will substitute something phonetically similar.

Language

Not all languages are recognized equally well.

English consistently shows the best results because it has the most training data. Whisper large-v3 achieves 3–4% WER on clean English audio.

Other major languages like Spanish, German, French, and Russian perform well but slightly worse — typically 5–8% WER on clean audio. On real-world recordings (meetings, phone calls), expect 12–20%.

Low-resource languages show significantly higher WER — from 15% to 40%+ even on clean audio, simply because the models were trained on far less data.

WER Across Different Models

Comparative results for popular models on standard benchmarks (clean speech, studio quality):

Model	English	Russian	Spanish	German
Whisper large-v3	3–4%	5–7%	4–5%	5–6%
Google Speech-to-Text (V2)	4–5%	6–8%	5–7%	6–8%
Azure Speech	4–5%	6–9%	5–7%	5–7%
Deepgram Nova-2	3–4%	7–10%	5–7%	6–8%
GigaAM2 (Sber)	—	4–6%	—	—

Important caveat: these numbers are for clean audio under controlled conditions. On real-world recordings, expect WER to be 1.5–3x higher. Different benchmarks also yield different results, so comparing numbers from different sources requires caution. For a detailed comparison of transcription models and services for the Russian language, see our market guide.

Limitations of WER

Despite its ubiquity, WER is far from a perfect metric. It has significant limitations.

Ignores punctuation. WER compares only words, disregarding commas, periods, and other punctuation marks. Yet punctuation can fundamentally change meaning: "Let's eat, grandma" versus "Let's eat grandma."

Ignores capitalization. "Paris" and "paris" are the same to WER, though this may matter in text output.

Does not distinguish error severity. Substituting "conference" with "conferences" (inflectional form) and substituting "approved" with "cancelled" both count as one substitution, even though the second completely changes the meaning.

Does not account for normalization. "15" and "fifteen," "Mr." and "Mister," "%" and "percent" — these are different strings to WER, despite being semantically identical.

WER can exceed 100%. If the system inserts many extra words, the numerator can exceed the denominator. Rare in practice, but formally possible.

Does not reflect readability. A transcript with 10% WER where errors are evenly distributed may read better than one with 5% WER where all errors are concentrated in a single critical paragraph.

Alternative Metrics

Due to WER's limitations, researchers and developers also use other metrics.

CER (Character Error Rate)

The character-level equivalent of WER. Same formula, but counting individual characters instead of words. CER is especially useful for languages that do not separate words with spaces (Chinese, Japanese, Thai) and for evaluating morphological errors in inflected languages: "book" vs "books" is a 100% error in WER but roughly 20% in CER (one character changed out of five).

MER (Match Error Rate)

A normalized version of WER that accounts for the alignment between reference and hypothesis words. MER always stays in the 0–1 range, unlike WER which can exceed 100%.

WIL (Word Information Lost)

A metric that considers both precision and recall of recognition. WIL indicates what proportion of information was lost. It is considered a more balanced assessment than WER.

Subjective Evaluation

MOS (Mean Opinion Score) — an average human rating on a scale from 1 to 5. A group of evaluators rates the transcription quality and their scores are averaged. Expensive and slow, but the most accurate reflection of real-world quality.

Readability assessment — instead of word-by-word comparison, experts evaluate how well the text conveys the meaning of the original and how easy it is to read. Sometimes a transcript with higher WER can be more readable than one with lower WER but poor punctuation.

How to Improve WER for Your Tasks

If transcription quality is not meeting your needs, here is what you can do — in order of effectiveness.

1. Improve audio quality. This is the single most impactful step. Use an external microphone, minimize background noise, record in a quiet room. Simply switching from a laptop built-in mic to a lavalier can reduce WER by 5–10%.

2. Choose the right model. For maximum accuracy, use large models: Whisper large-v3 for multilingual tasks, specialized models like GigaAM for Russian. Smaller models (tiny, small) are faster but make more errors.

3. Apply post-processing. Automatic punctuation, number normalization, abbreviation expansion, correction of common errors — all of these improve readability even if they do not formally reduce WER.

4. Use fine-tuning. If you work with specialized vocabulary (medical, legal, IT), fine-tuning a model on your terminology can reduce WER by 20–40% relative for those terms.

5. Use an optimized service. Services like Diktovka combine Whisper large-v3 with speaker diarization, normalization, and AI summarization to deliver the best possible results without manual tuning. This is the simplest way to achieve low WER without technical expertise.

Key Takeaways

WER remains the gold standard for evaluating speech recognition quality, despite its limitations. Understanding this metric helps you:

Make informed choices when selecting transcription tools
Set realistic expectations for what ASR systems can deliver
Take practical steps to improve recognition quality
See through marketing claims of "99% accuracy"

Remember: 5% WER does not mean the text is perfect — it means roughly one in every 20 words will contain an error. For a short recording, that may be invisible. For an hour-long lecture, that is dozens of mistakes. Context, audio quality, and choosing the right tool make all the difference.

FAQ

What is a good WER for speech recognition?

WER below 5% is excellent quality — the text can be used without editing. 5-10% is good with minimal corrections needed. 10-20% is acceptable, with the core meaning clear. Above 20% is poor quality that requires re-listening.

How is WER calculated?

WER = (S + D + I) / N x 100%, where S is substitutions (incorrectly recognized words), D is deletions (missed words), I is insertions (extra words added), and N is the total number of words in the reference text.

What is the difference between WER and CER?

WER counts errors at the word level, while CER (Character Error Rate) counts errors at the individual character level. CER is more useful for evaluating morphological errors: changing 'book' to 'books' is a 100% error in WER but only about 20% in CER.

Why can WER exceed 100%?

WER can exceed 100% because the formula's numerator includes insertions — words the system added that were not in the original. If there are many insertions, the numerator becomes larger than the denominator. In practice, this is rare.

What WER do modern models achieve for major languages?

Whisper large-v3 achieves 3-4% WER on clean English audio and 5-7% for Russian. On real-world recordings (meetings, phone calls), expect 12-18% due to noise, accents, and overlapping speech.

Try it for free