How to Improve Audio Quality for Transcription: A Complete Guide
Audio quality is the single biggest factor that determines transcription accuracy. Even the most advanced speech recognition models, including OpenAI Whisper, produce significantly worse results on noisy, quiet, or distorted recordings. This guide covers concrete steps to record clean audio and prepare your files for transcription.
Why Audio Quality Matters
The relationship between recording quality and transcription accuracy is direct and measurable. The industry standard metric is WER (Word Error Rate) — the percentage of incorrectly recognized words.
Typical WER benchmarks:
- Clean studio recording: 3-5% errors — near-perfect transcription
- Good recording in a quiet room: 5-8% — minimal editing needed
- Recording with background noise: 15-25% — every 4th to 6th word is wrong
- Poor recording (noise, echo, quiet voice): 25-40% — the text requires heavy editing
The difference between 5% and 25% WER is the difference between "copy and use" and "spend an hour on manual corrections." Investing 10 minutes in recording preparation saves you hours of editing.
How to Record Clean Audio
Choosing a Microphone
Your laptop's built-in microphone is the worst option for transcription. It picks up every room sound: keyboard clicks, fan noise, street sounds. Even a budget external microphone will produce dramatically better results.
USB microphones (for desk recording):
- Fifine K669 (~$25) — a budget condenser USB mic. Excellent quality for the price, plugs directly into your computer. Great for getting started.
- Samson Q2U (~$70) — dual USB/XLR mic, meaning it grows with you. Clean sound, built-in headphone jack for monitoring. A favorite among podcasters on a budget.
- Blue Yeti (~$100) — the classic USB microphone. Four polar patterns, excellent quality. If your budget allows, this is the go-to choice.
Lavalier microphones (for interviews and conversations):
- Boya BY-M1 (~$20) — a wired lavalier with excellent price-to-quality ratio. Connects via 3.5mm jack.
- Rode Wireless GO II (~$250) — wireless lavalier system with two transmitters. Perfect for two-person interviews with independent channels.
- Clip the lavalier 15-20 cm from the mouth — this guarantees a clean voice with minimal background noise.
For meetings and group recordings:
- Jabra Speak 510 (~$100) — a speakerphone with omnidirectional microphone. Captures voices from all around the table.
- Anker PowerConf S3 (~$70) — a budget conference speakerphone with 6 built-in microphones and 360-degree pickup.
- For group recordings, microphone placement matters more than price — one good mic in the center of the table beats an expensive one on the edge.
Recording Best Practices
Even with a great microphone, you can get a bad recording if you ignore basic rules.
Room selection:
- Close windows and doors
- Turn off air conditioning, fans, humidifiers — any sources of constant noise
- Soft furniture, curtains, carpets are your allies — they absorb echo
- Avoid empty rooms with bare walls — they produce strong reverb
Distance to microphone:
- Optimal: 15-30 cm (6-12 inches) from mouth to microphone
- Too close (<10 cm): plosive consonants (p, b, t) cause "pops" — clicks in the recording
- Too far (>50 cm): your voice drowns in room ambience
- Use a pop filter for desktop microphones — an inexpensive mesh screen that eliminates breath pops
Volume levels:
- Check levels in your recording app before starting
- Ideal range: -12 to -6 dB (peak level)
- If the meter hits the red zone, you are overloading the microphone and the audio will be distorted
- It is better to record slightly quieter — you can boost volume in post-processing, but you cannot remove distortion
Recording format:
- WAV or FLAC — for maximum quality (lossless)
- MP3 320 kbps — an acceptable compromise when file size matters
- MP3 128 kbps and below — noticeable quality loss, avoid for important recordings
- Most recording apps let you choose the format — choose WAV
Recording Meetings and Calls
In-person meetings:
- Place the microphone in the center of the table
- For more than 6 participants, use multiple microphones or a conference speakerphone
- Ask participants not to talk over each other — even the best diarization algorithm cannot separate simultaneous speech
Recording Zoom/Teams/Google Meet:
- Use the platform's built-in recording feature — it captures audio directly, without going through speakers and microphone
- In Zoom: Settings → Recording → "Record a separate audio file for each participant" — this is ideal for transcription with diarization
- Alternative: OBS Studio (free) can record system audio from any source
Recording phone calls:
- On iPhone: there is no built-in call recording; use TapeACall or Rev Call Recorder
- On Android: ACR (Another Call Recorder) or Cube ACR
- Call recording quality is always lower — phone networks use compressed codecs. This is normal; Whisper handles this quality level well
Audio Preprocessing Before Transcription
If the recording is already done and the quality is not ideal, all is not lost. Basic processing can significantly improve transcription results.
Noise Reduction
Audacity (free, Windows/Mac/Linux):
Audacity is the most popular free audio editor. Here is a step-by-step noise reduction guide:
- Open your file in Audacity
- Find a section where nobody is speaking but background noise is audible (at least 1-2 seconds)
- Select that section with your mouse
- Menu: Effects → Noise Reduction → "Get Noise Profile"
- Select the entire recording (Ctrl+A / Cmd+A)
- Menu: Effects → Noise Reduction → adjust parameters:
- Noise reduction: 12-18 dB (start at 12, increase if noise persists)
- Sensitivity: 6-8
- Frequency smoothing: 3-6
- Click "Preview" to check, then "OK"
Adobe Podcast Enhance (free online tool):
Adobe offers a free speech enhancement tool at podcast.adobe.com/enhance. Upload your file — the AI automatically removes noise, adds voice clarity, and normalizes volume. Limit: files up to 1 hour. The results are impressive — often better than manual processing.
FFmpeg (command line):
For those who prefer automation, FFmpeg offers powerful filters. The afftdn filter provides adaptive noise reduction based on FFT. For more aggressive noise removal, increase the noise reduction parameter to 30-40. The silenceremove filter helps trim long pauses, which is useful for saving processing time.
Volume Normalization
Normalization evens out the recording volume — quiet speech gets louder, peaks get smoothed.
Why it matters:
- Whisper and other models work better with properly leveled audio
- If a recording has multiple speakers at different volumes, normalization balances them
- Quiet sections are often transcribed with errors
How to do it in Audacity:
- Open your file
- Select the entire recording (Ctrl+A / Cmd+A)
- Menu: Effects → Normalize
- Set peak amplitude to: -1.0 dB
- Click "OK"
For more advanced normalization, use the Compressor (Effects → Compressor) — it evens out the difference between quiet and loud sections without clipping peaks.
Format Conversion
There is an optimal audio format for transcription. Diktovka automatically converts uploaded files, but if you are processing manually, here are the ideal parameters:
Optimal parameters for transcription:
- Channels: Mono (1 channel)
- Sample rate: 16,000 Hz (16 kHz)
- Bit depth: 16-bit
- Format: WAV or Opus
Why mono is better than stereo:
- Speech recognition models work with mono signals
- A stereo file gets converted to mono before processing — that is an unnecessary step
- In mono, the voice is stronger relative to background noise
- The file is half the size
In Audacity: Tracks → Mix → Mix Stereo Down to Mono. Then: Project → Rate → 16000 Hz. Export: File → Export → WAV 16-bit.
Common Problems and Solutions
| Problem | Cause | Solution |
|---|---|---|
| Background noise (hum, hiss) | HVAC, electronics, traffic | Noise reduction in Audacity or Adobe Enhance |
| Echo and reverb | Empty room, bare walls | De-reverb filter; for future recordings, use a room with soft furnishings |
| Quiet voice | Too far from microphone | Normalization; when recording, move closer to the mic |
| Overlapping speakers | People talking simultaneously | Cannot be fully fixed, but diarization in Diktovka helps separate speakers |
| Background music | Radio, ambient music | Vocal isolation tools (UVR5, Demucs); best solution: turn off music during recording |
| Pops and clicks | Too close to mic, no pop filter | De-click filter in Audacity; use a pop filter or angle the mic 45 degrees |
| Distortion (clipping) | Microphone overload | Cannot be fixed after the fact; lower the input level before recording |
| Phone quality | Compressed voice codec | Normalization + light noise reduction; use VoIP when possible for better quality |
Diktovka Automatically Optimizes Your Audio
The Diktovka platform automatically performs key preparation steps when you upload a file:
- Conversion to the optimal format (mono, 16 kHz, Opus 32 kbps)
- FFmpeg processing — basic normalization and signal preparation
- Speaker diarization — automatic detection of who is speaking
- AI summarization — a brief summary of the recording
The platform handles even imperfect recordings — phone calls, noisy meeting recordings, voice messages. But the better the source quality, the more accurate the result. Investing 10 minutes in preparation yields a significantly more accurate transcription.
Pre-Recording Checklist
Print this out or save it — check before every important recording:
- Microphone is connected and selected as the input device in your system settings
- Test recording done — listen to 10 seconds, verify the audio is clean
- Room is quiet — windows closed, noisy devices off
- Distance to microphone — 15-30 cm (or lavalier clipped 15-20 cm from mouth)
- Recording level — peaks between -12 and -6 dB, not hitting the red zone
- Recording format — WAV or FLAC (not MP3 128 kbps)
- Sufficient disk space — WAV uses ~10 MB/min
- Ask participants not to interrupt and to speak clearly
- Pop filter in place (for desktop microphones)
- Recording is running — sounds obvious, but it gets forgotten more often than you think
Conclusion
Improving audio quality for transcription is not rocket science. A decent microphone for $25-100, a quiet room, and proper recording settings deliver 80% of the result. The remaining 20% is post-processing in Audacity or Adobe Enhance.
Upload your prepared audio to Diktovka — and get a transcription that barely needs editing.
FAQ
What microphone is best for transcription?
For desk recording, a USB microphone works best: the budget Fifine K669 (~$25) or Blue Yeti (~$100) for top quality. For interviews, a lavalier like Boya BY-M1 (~$20). For meetings, a speakerphone like Jabra Speak 510. Even a budget external microphone is dramatically better than a laptop's built-in mic.
How do I remove noise from an audio recording before transcription?
In free Audacity: find a silent section with background noise, select it, apply 'Get Noise Profile', then select the entire recording and run 'Noise Reduction' (12-18 dB). An easier option is Adobe Podcast Enhance (free online tool), which automatically cleans audio using AI.
What is the minimum audio quality needed for good transcription?
For 5-8% WER (minimal editing needed), record in a quiet room with an external microphone 15-30 cm away. Use WAV or MP3 320 kbps format. With noisy recordings, WER rises to 15-25%, and with poor quality (echo, quiet voice) to 25-40%, requiring extensive manual editing.
What audio format is best for transcription?
Optimal settings: mono, 16 kHz, 16-bit WAV. Mono is better than stereo — speech recognition models work with single-channel signal, voice is stronger relative to background noise, and the file is half the size. Avoid MP3 128 kbps and below due to noticeable quality loss.
How can I improve a recording using FFmpeg?
FFmpeg offers the afftdn filter for adaptive noise reduction based on FFT. For aggressive noise reduction, increase the noise reduction parameter to 30-40. The silenceremove filter removes long pauses, saving processing time. For optimal format conversion: mono, 16 kHz, 16-bit.