All articles

WER (Word Error Rate): How Speech Recognition Accuracy Is Measured

·14 min read

Word Error Rate (WER) is the gold standard metric for evaluating speech recognition quality. We break down the formula, walk through real examples, explain what different WER values mean in practice, and cover the factors that make or break transcription accuracy. If you have ever wondered why one transcription service produces near-perfect text while another delivers garbled nonsense, the answer almost always comes down to three letters: WER.


What Is WER

Word Error Rate (WER) is the standard metric used to measure the accuracy of automatic speech recognition (ASR) systems. In simple terms, WER tells you what percentage of words the system got wrong.

The concept is straightforward: take a reference transcript (what was actually said), compare it to the system output (what the ASR produced), and count the errors. The lower the WER, the better the recognition.

WER is used everywhere — in academic papers, API documentation for speech recognition services, model comparison benchmarks, and product evaluations. It is the lingua franca of the ASR industry, the common language shared by researchers, developers, and end users alike.


The WER Formula

The WER formula is:

WER = (S + D + I) / N x 100%

Where:

Notice that the numerator contains three types of errors, while the denominator is only the reference word count. This means WER can theoretically exceed 100% (if there are many insertions), though this is rare in practice.


How WER Is Calculated: A Worked Example

Let us walk through a concrete example.

Reference (what was actually said): "I want to book a train ticket"

ASR output: "I want to book a plane ticket"

Comparing word by word:

PositionReferenceRecognizedError Type
1IICorrect
2wantwantCorrect
3totoCorrect
4bookbookCorrect
5aaCorrect
6trainplaneSubstitution (S)
7ticketticketCorrect

Result:

WER = (1 + 0 + 0) / 7 x 100% = 14.3%

Now consider a more complex example with all three error types:

Reference: "The meeting will take place tomorrow at ten in the morning"

ASR output: "The meeting will take place at ten o'clock in the morning"

PositionReferenceRecognizedError Type
1TheTheCorrect
2meetingmeetingCorrect
3willwillCorrect
4taketakeCorrect
5placeplaceCorrect
6tomorrowDeletion (D)
7atatCorrect
8tentenCorrect
9o'clockInsertion (I)
10ininCorrect
11thetheCorrect
12morningmorningCorrect

WER = (0 + 1 + 1) / 10 x 100% = 20%

Note something important: the system dropped "tomorrow" — a word that carries critical meaning — while the inserted "o'clock" is essentially harmless. WER treats both errors equally, which is one of its known limitations.


What Different WER Values Mean

Not all WER values have the same practical impact. Here is a general scale:

WERQualityPractical Meaning
Below 5%ExcellentProfessional-grade, usable without editing. Publish-ready
5–10%GoodMinimal editing needed. Suitable for notes, meeting minutes, subtitles
10–20%AcceptableNoticeable errors, but the core meaning is clear. Needs significant editing
20–30%PoorRequires re-listening and substantial corrections
Above 30%UnusableFaster to type from scratch

Context matters enormously. For medical documentation, even 5% WER may be unacceptable — a single wrong drug name is a patient safety issue. For personal voice notes, 15% WER is perfectly fine as long as the main ideas come through.


Factors That Affect WER

Transcription accuracy depends on many factors. Understanding them helps you choose the right tool and prepare your audio for the best possible results.

Audio Quality

This is the single biggest factor — often more impactful than which model you use.

Background noise is the most common accuracy killer. Air conditioning hum, conversations in the next room, street noise, background music — all of these add 5–20 percentage points to WER depending on intensity. A signal-to-noise ratio (SNR) below 10 dB makes transcription essentially useless for most systems.

Microphone quality makes a significant difference. A good external microphone placed close to the speaker can reduce WER by 3–10% compared to a laptop built-in mic at arm's length. Headsets and lavalier mics are a transcriptionist's best friends.

Reverberation and echo add 5–15% to WER. Recording in a large empty room or using speakerphone significantly degrades recognition. Soft surfaces, carpets, curtains — anything that absorbs sound — helps.

Speech Characteristics

Accent and dialect increase WER by 5–15%. Models are trained primarily on standard pronunciation. A strong regional accent, dialect, or non-native speaker accent noticeably reduces accuracy.

Speaking speed at a fast pace adds 3–10% to WER. When people speak rapidly, words blur together, boundaries between them become fuzzy, and models struggle to segment them.

Overlapping speech is the hardest scenario for ASR systems. When two people talk simultaneously, WER can increase by 10–30%. Even models with diarization (speaker separation) handle crosstalk poorly.

Specialized vocabulary — technical jargon, abbreviations, company and product names — adds 5–15% to WER. The model may not know the word "decontamination" or the drug name "Amoxiclav" and will substitute something phonetically similar.

Language

Not all languages are recognized equally well.

English consistently shows the best results because it has the most training data. Whisper large-v3 achieves 3–4% WER on clean English audio.

Other major languages like Spanish, German, French, and Russian perform well but slightly worse — typically 5–8% WER on clean audio. On real-world recordings (meetings, phone calls), expect 12–20%.

Low-resource languages show significantly higher WER — from 15% to 40%+ even on clean audio, simply because the models were trained on far less data.


WER Across Different Models

Comparative results for popular models on standard benchmarks (clean speech, studio quality):

ModelEnglishRussianSpanishGerman
Whisper large-v33–4%5–7%4–5%5–6%
Google Speech-to-Text (V2)4–5%6–8%5–7%6–8%
Azure Speech4–5%6–9%5–7%5–7%
Deepgram Nova-23–4%7–10%5–7%6–8%
GigaAM2 (Sber)4–6%

Important caveat: these numbers are for clean audio under controlled conditions. On real-world recordings, expect WER to be 1.5–3x higher. Different benchmarks also yield different results, so comparing numbers from different sources requires caution. For a detailed comparison of transcription models and services for the Russian language, see our market guide.


Limitations of WER

Despite its ubiquity, WER is far from a perfect metric. It has significant limitations.

Ignores punctuation. WER compares only words, disregarding commas, periods, and other punctuation marks. Yet punctuation can fundamentally change meaning: "Let's eat, grandma" versus "Let's eat grandma."

Ignores capitalization. "Paris" and "paris" are the same to WER, though this may matter in text output.

Does not distinguish error severity. Substituting "conference" with "conferences" (inflectional form) and substituting "approved" with "cancelled" both count as one substitution, even though the second completely changes the meaning.

Does not account for normalization. "15" and "fifteen," "Mr." and "Mister," "%" and "percent" — these are different strings to WER, despite being semantically identical.

WER can exceed 100%. If the system inserts many extra words, the numerator can exceed the denominator. Rare in practice, but formally possible.

Does not reflect readability. A transcript with 10% WER where errors are evenly distributed may read better than one with 5% WER where all errors are concentrated in a single critical paragraph.


Alternative Metrics

Due to WER's limitations, researchers and developers also use other metrics.

CER (Character Error Rate)

The character-level equivalent of WER. Same formula, but counting individual characters instead of words. CER is especially useful for languages that do not separate words with spaces (Chinese, Japanese, Thai) and for evaluating morphological errors in inflected languages: "book" vs "books" is a 100% error in WER but roughly 20% in CER (one character changed out of five).

MER (Match Error Rate)

A normalized version of WER that accounts for the alignment between reference and hypothesis words. MER always stays in the 0–1 range, unlike WER which can exceed 100%.

WIL (Word Information Lost)

A metric that considers both precision and recall of recognition. WIL indicates what proportion of information was lost. It is considered a more balanced assessment than WER.

Subjective Evaluation

MOS (Mean Opinion Score) — an average human rating on a scale from 1 to 5. A group of evaluators rates the transcription quality and their scores are averaged. Expensive and slow, but the most accurate reflection of real-world quality.

Readability assessment — instead of word-by-word comparison, experts evaluate how well the text conveys the meaning of the original and how easy it is to read. Sometimes a transcript with higher WER can be more readable than one with lower WER but poor punctuation.


How to Improve WER for Your Tasks

If transcription quality is not meeting your needs, here is what you can do — in order of effectiveness.

1. Improve audio quality. This is the single most impactful step. Use an external microphone, minimize background noise, record in a quiet room. Simply switching from a laptop built-in mic to a lavalier can reduce WER by 5–10%.

2. Choose the right model. For maximum accuracy, use large models: Whisper large-v3 for multilingual tasks, specialized models like GigaAM for Russian. Smaller models (tiny, small) are faster but make more errors.

3. Apply post-processing. Automatic punctuation, number normalization, abbreviation expansion, correction of common errors — all of these improve readability even if they do not formally reduce WER.

4. Use fine-tuning. If you work with specialized vocabulary (medical, legal, IT), fine-tuning a model on your terminology can reduce WER by 20–40% relative for those terms.

5. Use an optimized service. Services like Diktovka combine Whisper large-v3 with speaker diarization, normalization, and AI summarization to deliver the best possible results without manual tuning. This is the simplest way to achieve low WER without technical expertise.


Key Takeaways

WER remains the gold standard for evaluating speech recognition quality, despite its limitations. Understanding this metric helps you:

Remember: 5% WER does not mean the text is perfect — it means roughly one in every 20 words will contain an error. For a short recording, that may be invisible. For an hour-long lecture, that is dozens of mistakes. Context, audio quality, and choosing the right tool make all the difference.

FAQ

What is a good WER for speech recognition?

WER below 5% is excellent quality — the text can be used without editing. 5-10% is good with minimal corrections needed. 10-20% is acceptable, with the core meaning clear. Above 20% is poor quality that requires re-listening.

How is WER calculated?

WER = (S + D + I) / N x 100%, where S is substitutions (incorrectly recognized words), D is deletions (missed words), I is insertions (extra words added), and N is the total number of words in the reference text.

What is the difference between WER and CER?

WER counts errors at the word level, while CER (Character Error Rate) counts errors at the individual character level. CER is more useful for evaluating morphological errors: changing 'book' to 'books' is a 100% error in WER but only about 20% in CER.

Why can WER exceed 100%?

WER can exceed 100% because the formula's numerator includes insertions — words the system added that were not in the original. If there are many insertions, the numerator becomes larger than the denominator. In practice, this is rare.

What WER do modern models achieve for major languages?

Whisper large-v3 achieves 3-4% WER on clean English audio and 5-7% for Russian. On real-world recordings (meetings, phone calls), expect 12-18% due to noise, accents, and overlapping speech.