I’m working with ASR (Azure Speech) and running into a consistent issue where mispronunciations get normalised to the intended word.
Example: a speaker says “tanks” (/t/), but the system confidently outputs “thanks” (/θ/).
This makes pronunciation evaluation difficult because:
the transcript appears correct phoneme-level data is often incomplete or unreliable
confidence scores don’t reflect the actual substitution
I’m aware this is partly due to the language model biasing toward likely words, but I’m trying to understand how people handle this in practice.
Questions:
Is there any reliable way to detect contrast errors like /θ/ → /t/ without fully trusting phoneme output?
Do people use constrained decoding / forced alignment / alternative models for this?
Or is this fundamentally a limitation of current ASR systems?
Context: this is for a controlled setup (fixed prompts, repeated target words), not open-ended speech.
Would appreciate any practical approaches or confirmation that this is a known limitation.