u/Batman_255 — reddlx

▲ 1 r/TextToSpeech

What is the best open-source TTS that can be used in production to handle multiple users for a real-time customer service web AI agent?

We need it to support:
\- Real-time streaming
\- Chunked audio generation
\- Multiple concurrent users
\- Low latency
\- Production deployment

The goal is to use it inside a web-based AI agent for live customer support conversations.

What are the best options people are using right now?

reddit.com

u/Batman_255 — 8 days ago

▲ 1 r/AI_Agents

What is the best open-source TTS that can be used in production to handle multiple users for a real-time customer service web AI agent?

We need it to support:
- Real-time streaming
- Chunked audio generation
- Multiple concurrent users
- Low latency
- Production deployment

The goal is to use it inside a web-based AI agent for live customer support conversations.

What are the best options people are using right now?

reddit.com

u/Batman_255 — 8 days ago

▲ 7 r/TextToSpeech

Hey everyone,

I’m trying to build and deploy a real-time streaming TTS system and wanted to ask for advice from people who’ve already done this in production.

My goal is:
- Low-latency streaming voice generation
- Handle multiple concurrent requests/users
- Production-ready deployment
- Scalable architecture

Right now I’m researching different approaches and I noticed a lot of people recommend newer LLM-based / speech-language models instead of traditional TTS pipelines. I’ve also been looking into models/frameworks like Orpheus and similar modern streaming TTS systems.

I have a few questions:

What’s currently considered the best stack for real-time streaming TTS?
- vLLM?
- Triton?
- Custom FastAPI/WebSocket setup?
- Something else?
For concurrent streaming requests:
- How do you usually handle batching and latency?
- Do you keep one persistent model in memory?
- How many users can realistically be handled per GPU?
GPU requirements:
- What GPU would you recommend for small/medium-scale production?
- Is something like an RTX 4090 enough?
- Or do I need A100/H100/L40S level GPUs?
Cost expectations:
- Rough monthly cost for deployment?
- Cloud vs self-hosted?
- Any recommendations for cheaper GPU providers?
For people using LLM-based speech models:
- Are they actually better for naturalness and streaming?
- What are the tradeoffs compared to traditional TTS models?
Infrastructure questions:
- Best way to stream audio chunks to clients?
- WebSockets vs gRPC?
- Any good architecture patterns for scaling?

Would really appreciate any production insights, benchmarks, lessons learned, or open-source repos you recommend.

Thanks!

reddit.com

u/Batman_255 — 11 days ago