We’ve built what is essentially a full real-time telephony conversational operating system, not just a chatbot, and we’re trying to diagnose where our biggest failures actually are.
What we built:
A live voice pipeline for outbound/inbound calls:
Telephony (8kHz µ-law) → PCM decode → VAD → Silence thresholds → Echo suppression / AEC → STT (Deepgram/Groq/Sarvam) → Validation / hallucination filters → State machine → LLM (Groq LLaMA) → TTS (Grok) → Playback
Current capabilities:
Real-time Hindi + Hinglish support
Sales / lead-gen / support agents
Silero VAD
Deepgram Nova-3 primary STT
Groq LLaMA 3.x
Grok TTS
Barge-in
Sentence streaming
TTS cache
Carrier suppression
Hallucination filtering
Hindi grammar / transliteration optimization
Pipecat-style orchestration
FAISS RAG
The problem:
Users often feel like:
“The AI forgot what I said”
or
“It stopped responding”
or
“It heard me but replied weirdly”
But from logs, the LLM itself is often fine.
What we’re seeing:
STT:
Hindi strong
Hinglish moderate
Brand/model names weak
Short acknowledgements (“haan”, “ji”) vulnerable
Some blank transcripts / segmentation misses
TTS:
Biggest bottleneck
1.1–2.4s latency
“Response ended prematurely”
Long Hindi promotional lines degrade badly
Pipeline suspicion:
We may have over-engineered thresholds:
VAD
RMS gates
Silence windows
Echo suppression
Carrier suppression
Hallucination filtering
Confidence thresholds
Our current hypothesis:
This may not be a memory problem.
It may be a pipeline integrity problem where user intent is getting:
Clipped before STT
Mis-segmented
Filtered out
Suppressed during state transitions
Corrupted before conversational memory ever forms
Example:
Caller says a short Hindi response during suppression or barge-in window → speech never becomes canonical transcript → LLM never truly receives it → AI appears forgetful.
Questions for people who’ve built production voice stacks:
- Where do advanced telephony systems most commonly lose conversational fidelity?
VAD?
Endpointing?
Suppression windows?
STT confidence gates?
State machine transitions?
- For Hindi/Hinglish specifically:
How are people handling:
Short acknowledgements
Code-switching
Brand names
Telecom narrowband degradation?
- Would you simplify the stack?
Are we harming reliability by stacking too many protections before STT?
- TTS:
Would you prioritize:
Faster lower-quality speech
Smaller sentence chunks
Interruptibility
over polished voice quality?
- Architecture:
At what point does “production safety” become “signal destruction”?
Brutal honesty welcome:
If this architecture sounds overbuilt, fragile, or fundamentally mis-prioritized, I’d genuinely love to hear it.
We’re trying to move from:
“Smart AI on a fragile phone line”
to:
“Reliable conversational telecom system”
Right now it feels like our AI may actually be smarter than the user experience — but too much user intent dies before intelligence can act.
Would really appreciate insights from:
Voice AI engineers
Contact center architects
Telecom DSP people
Deepgram / Whisper / Pipecat builders
Hindi ASR/TTS teams
Thanks — looking for architecture-level criticism, not just model suggestions.