u/Batman_255

What is the best open-source TTS that can be used in production to handle multiple users for a real-time customer service web AI agent?

We need it to support:
\- Real-time streaming
\- Chunked audio generation
\- Multiple concurrent users
\- Low latency
\- Production deployment

The goal is to use it inside a web-based AI agent for live customer support conversations.

What are the best options people are using right now?

reddit.com
u/Batman_255 — 8 days ago

What is the best open-source TTS that can be used in production to handle multiple users for a real-time customer service web AI agent?

We need it to support:
- Real-time streaming
- Chunked audio generation
- Multiple concurrent users
- Low latency
- Production deployment

The goal is to use it inside a web-based AI agent for live customer support conversations.

What are the best options people are using right now?

reddit.com
u/Batman_255 — 8 days ago

Hey everyone,

I’m trying to build and deploy a real-time streaming TTS system and wanted to ask for advice from people who’ve already done this in production.

My goal is:
- Low-latency streaming voice generation
- Handle multiple concurrent requests/users
- Production-ready deployment
- Scalable architecture

Right now I’m researching different approaches and I noticed a lot of people recommend newer LLM-based / speech-language models instead of traditional TTS pipelines. I’ve also been looking into models/frameworks like Orpheus and similar modern streaming TTS systems.

I have a few questions:

  1. What’s currently considered the best stack for real-time streaming TTS?
    - vLLM?
    - Triton?
    - Custom FastAPI/WebSocket setup?
    - Something else?

  2. For concurrent streaming requests:
    - How do you usually handle batching and latency?
    - Do you keep one persistent model in memory?
    - How many users can realistically be handled per GPU?

  3. GPU requirements:
    - What GPU would you recommend for small/medium-scale production?
    - Is something like an RTX 4090 enough?
    - Or do I need A100/H100/L40S level GPUs?

  4. Cost expectations:
    - Rough monthly cost for deployment?
    - Cloud vs self-hosted?
    - Any recommendations for cheaper GPU providers?

  5. For people using LLM-based speech models:
    - Are they actually better for naturalness and streaming?
    - What are the tradeoffs compared to traditional TTS models?

  6. Infrastructure questions:
    - Best way to stream audio chunks to clients?
    - WebSockets vs gRPC?
    - Any good architecture patterns for scaling?

Would really appreciate any production insights, benchmarks, lessons learned, or open-source repos you recommend.

Thanks!

reddit.com
u/Batman_255 — 11 days ago