How to Improve Audio Quality for Transcription: A Complete Guide

March 28, 2026·15 min read

Audio quality is the single biggest factor that determines transcription accuracy. Even the most advanced speech recognition models, including OpenAI Whisper, produce significantly worse results on noisy, quiet, or distorted recordings. This guide covers concrete steps to record clean audio and prepare your files for transcription.

Why Audio Quality Matters

The relationship between recording quality and transcription accuracy is direct and measurable. The industry standard metric is WER (Word Error Rate) — the percentage of incorrectly recognized words.

Typical WER benchmarks:

Clean studio recording: 3-5% errors — near-perfect transcription
Good recording in a quiet room: 5-8% — minimal editing needed
Recording with background noise: 15-25% — every 4th to 6th word is wrong
Poor recording (noise, echo, quiet voice): 25-40% — the text requires heavy editing

The difference between 5% and 25% WER is the difference between "copy and use" and "spend an hour on manual corrections." Investing 10 minutes in recording preparation saves you hours of editing.

How to Record Clean Audio

Choosing a Microphone

Your laptop's built-in microphone is the worst option for transcription. It picks up every room sound: keyboard clicks, fan noise, street sounds. Even a budget external microphone will produce dramatically better results.

USB microphones (for desk recording):

Fifine K669 (~$25) — a budget condenser USB mic. Excellent quality for the price, plugs directly into your computer. Great for getting started.
Samson Q2U (~$70) — dual USB/XLR mic, meaning it grows with you. Clean sound, built-in headphone jack for monitoring. A favorite among podcasters on a budget.
Blue Yeti (~$100) — the classic USB microphone. Four polar patterns, excellent quality. If your budget allows, this is the go-to choice.

Lavalier microphones (for interviews and conversations):

Boya BY-M1 (~$20) — a wired lavalier with excellent price-to-quality ratio. Connects via 3.5mm jack.
Rode Wireless GO II (~$250) — wireless lavalier system with two transmitters. Perfect for two-person interviews with independent channels.
Clip the lavalier 15-20 cm from the mouth — this guarantees a clean voice with minimal background noise.

For meetings and group recordings:

Jabra Speak 510 (~$100) — a speakerphone with omnidirectional microphone. Captures voices from all around the table.
Anker PowerConf S3 (~$70) — a budget conference speakerphone with 6 built-in microphones and 360-degree pickup.
For group recordings, microphone placement matters more than price — one good mic in the center of the table beats an expensive one on the edge.

Recording Best Practices

Even with a great microphone, you can get a bad recording if you ignore basic rules.

Room selection:

Close windows and doors
Turn off air conditioning, fans, humidifiers — any sources of constant noise
Soft furniture, curtains, carpets are your allies — they absorb echo
Avoid empty rooms with bare walls — they produce strong reverb

Distance to microphone:

Optimal: 15-30 cm (6-12 inches) from mouth to microphone
Too close (<10 cm): plosive consonants (p, b, t) cause "pops" — clicks in the recording
Too far (>50 cm): your voice drowns in room ambience
Use a pop filter for desktop microphones — an inexpensive mesh screen that eliminates breath pops

Volume levels:

Check levels in your recording app before starting
Ideal range: -12 to -6 dB (peak level)
If the meter hits the red zone, you are overloading the microphone and the audio will be distorted
It is better to record slightly quieter — you can boost volume in post-processing, but you cannot remove distortion

Recording format:

WAV or FLAC — for maximum quality (lossless)
MP3 320 kbps — an acceptable compromise when file size matters
MP3 128 kbps and below — noticeable quality loss, avoid for important recordings
Most recording apps let you choose the format — choose WAV

Recording Meetings and Calls

In-person meetings:

Place the microphone in the center of the table
For more than 6 participants, use multiple microphones or a conference speakerphone
Ask participants not to talk over each other — even the best diarization algorithm cannot separate simultaneous speech

Recording Zoom/Teams/Google Meet:

Use the platform's built-in recording feature — it captures audio directly, without going through speakers and microphone
In Zoom: Settings → Recording → "Record a separate audio file for each participant" — this is ideal for transcription with diarization
Alternative: OBS Studio (free) can record system audio from any source

Recording phone calls:

On iPhone: there is no built-in call recording; use TapeACall or Rev Call Recorder
On Android: ACR (Another Call Recorder) or Cube ACR
Call recording quality is always lower — phone networks use compressed codecs. This is normal; Whisper handles this quality level well

Audio Preprocessing Before Transcription

If the recording is already done and the quality is not ideal, all is not lost. Basic processing can significantly improve transcription results.

Noise Reduction

Audacity (free, Windows/Mac/Linux):

Audacity is the most popular free audio editor. Here is a step-by-step noise reduction guide:

Open your file in Audacity
Find a section where nobody is speaking but background noise is audible (at least 1-2 seconds)
Select that section with your mouse
Menu: Effects → Noise Reduction → "Get Noise Profile"
Select the entire recording (Ctrl+A / Cmd+A)
Menu: Effects → Noise Reduction → adjust parameters:
- Noise reduction: 12-18 dB (start at 12, increase if noise persists)
- Sensitivity: 6-8
- Frequency smoothing: 3-6
Click "Preview" to check, then "OK"

Adobe Podcast Enhance (free online tool):

Adobe offers a free speech enhancement tool at podcast.adobe.com/enhance. Upload your file — the AI automatically removes noise, adds voice clarity, and normalizes volume. Limit: files up to 1 hour. The results are impressive — often better than manual processing.

FFmpeg (command line):

For those who prefer automation, FFmpeg offers powerful filters. The afftdn filter provides adaptive noise reduction based on FFT. For more aggressive noise removal, increase the noise reduction parameter to 30-40. The silenceremove filter helps trim long pauses, which is useful for saving processing time.

Volume Normalization

Normalization evens out the recording volume — quiet speech gets louder, peaks get smoothed.

Why it matters:

Whisper and other models work better with properly leveled audio
If a recording has multiple speakers at different volumes, normalization balances them
Quiet sections are often transcribed with errors

How to do it in Audacity:

Open your file
Select the entire recording (Ctrl+A / Cmd+A)
Menu: Effects → Normalize
Set peak amplitude to: -1.0 dB
Click "OK"

For more advanced normalization, use the Compressor (Effects → Compressor) — it evens out the difference between quiet and loud sections without clipping peaks.

Format Conversion

There is an optimal audio format for transcription. Diktovka automatically converts uploaded files, but if you are processing manually, here are the ideal parameters:

Optimal parameters for transcription:

Channels: Mono (1 channel)
Sample rate: 16,000 Hz (16 kHz)
Bit depth: 16-bit
Format: WAV or Opus

Why mono is better than stereo:

Speech recognition models work with mono signals
A stereo file gets converted to mono before processing — that is an unnecessary step
In mono, the voice is stronger relative to background noise
The file is half the size

In Audacity: Tracks → Mix → Mix Stereo Down to Mono. Then: Project → Rate → 16000 Hz. Export: File → Export → WAV 16-bit.

Common Problems and Solutions

Problem	Cause	Solution
Background noise (hum, hiss)	HVAC, electronics, traffic	Noise reduction in Audacity or Adobe Enhance
Echo and reverb	Empty room, bare walls	De-reverb filter; for future recordings, use a room with soft furnishings
Quiet voice	Too far from microphone	Normalization; when recording, move closer to the mic
Overlapping speakers	People talking simultaneously	Cannot be fully fixed, but diarization in Diktovka helps separate speakers
Background music	Radio, ambient music	Vocal isolation tools (UVR5, Demucs); best solution: turn off music during recording
Pops and clicks	Too close to mic, no pop filter	De-click filter in Audacity; use a pop filter or angle the mic 45 degrees
Distortion (clipping)	Microphone overload	Cannot be fixed after the fact; lower the input level before recording
Phone quality	Compressed voice codec	Normalization + light noise reduction; use VoIP when possible for better quality

Diktovka Automatically Optimizes Your Audio

The Diktovka platform automatically performs key preparation steps when you upload a file:

Conversion to the optimal format (mono, 16 kHz, Opus 32 kbps)
FFmpeg processing — basic normalization and signal preparation
Speaker diarization — automatic detection of who is speaking
AI summarization — a brief summary of the recording

The platform handles even imperfect recordings — phone calls, noisy meeting recordings, voice messages. But the better the source quality, the more accurate the result. Investing 10 minutes in preparation yields a significantly more accurate transcription.

Pre-Recording Checklist

Print this out or save it — check before every important recording:

Microphone is connected and selected as the input device in your system settings
Test recording done — listen to 10 seconds, verify the audio is clean
Room is quiet — windows closed, noisy devices off
Distance to microphone — 15-30 cm (or lavalier clipped 15-20 cm from mouth)
Recording level — peaks between -12 and -6 dB, not hitting the red zone
Recording format — WAV or FLAC (not MP3 128 kbps)
Sufficient disk space — WAV uses ~10 MB/min
Ask participants not to interrupt and to speak clearly
Pop filter in place (for desktop microphones)
Recording is running — sounds obvious, but it gets forgotten more often than you think

Conclusion

Improving audio quality for transcription is not rocket science. A decent microphone for $25-100, a quiet room, and proper recording settings deliver 80% of the result. The remaining 20% is post-processing in Audacity or Adobe Enhance.

Upload your prepared audio to Diktovka — and get a transcription that barely needs editing.

FAQ

What microphone is best for transcription?

For desk recording, a USB microphone works best: the budget Fifine K669 (~$25) or Blue Yeti (~$100) for top quality. For interviews, a lavalier like Boya BY-M1 (~$20). For meetings, a speakerphone like Jabra Speak 510. Even a budget external microphone is dramatically better than a laptop's built-in mic.

How do I remove noise from an audio recording before transcription?

In free Audacity: find a silent section with background noise, select it, apply 'Get Noise Profile', then select the entire recording and run 'Noise Reduction' (12-18 dB). An easier option is Adobe Podcast Enhance (free online tool), which automatically cleans audio using AI.

What is the minimum audio quality needed for good transcription?

For 5-8% WER (minimal editing needed), record in a quiet room with an external microphone 15-30 cm away. Use WAV or MP3 320 kbps format. With noisy recordings, WER rises to 15-25%, and with poor quality (echo, quiet voice) to 25-40%, requiring extensive manual editing.

What audio format is best for transcription?

Optimal settings: mono, 16 kHz, 16-bit WAV. Mono is better than stereo — speech recognition models work with single-channel signal, voice is stronger relative to background noise, and the file is half the size. Avoid MP3 128 kbps and below due to noticeable quality loss.

How can I improve a recording using FFmpeg?

FFmpeg offers the afftdn filter for adaptive noise reduction based on FFT. For aggressive noise reduction, increase the noise reduction parameter to 30-40. The silenceremove filter removes long pauses, saving processing time. For optimal format conversion: mono, 16 kHz, 16-bit.

Try it for free