
Tested Gemini 3.1 Flash Live for production voice calls, the feel is noticeably better but latency claims need context
Been building voice agents for a while now, and integrated Gemini 3.1 Flash Live into our open source stack as soon as the API went live. Wanted to share some honest observations
The good stuff first. The voice cadence and overall feel of calls is genuinely better than what you get from the classic STT + LLM + TTS pipeline. Turn-taking feels more natural. Interruptions are handled way more gracefully. The model just "gets" conversational rhythm in a way that stitching together STT + LLM + TTS never really achieved. Cost also looks very competitive, which matters a lot in S2S.
Now the stuff nobody seems to be talking about. We averaged around 922ms latency end-to-end in our testing. That's not bad, but it's not the sub-300ms numbers I've seen some people throw around. We were testing from Asia, so region probably plays a role here. Would love to know what others are seeing from US/EU
The other thing that caught us off guard is transcripts. You can't access them live during the call, only after it's done. If you're doing any kind of context stitching or real-time context engineering during conversations, this makes things harder.
Honestly though I don't think we're going back to the old pipeline. The quality gap in how the conversation actually feels is too big.
We integrated this into Dograh, our open-source voice agent platform (very much like Vapi) , if anyone wants to try it themselves: https://github.com/dograh-hq/dograh
What latency numbers are others getting? And has anyone found a clean workaround for the live transcript limitation?