u/Useful-Thing-1400

Need help for a calling based agentic ai project

I'm trying to build an agentic ai system which handles booking services and suggestions for a car dealership and service centers.
techstack:

  • stt - whisper model
  • tts - gtts
  • llm - llama 70b versatile
  • backend - python
  • db - postgres

I have already made backend but facing some latency issues
I also have to implement this like a calling system

Current call flow:
User speech → STT → text → LLM → response text → TTS → audio output

Latency :

  • STT: 300–700 ms
  • LLM: 1.5–3s (depending on response length)
  • TTS: Adds another 500 ms – 1s, especially for longer replies

Architecture:

  1. Capture audio input
  2. Send to STT
  3. Pass transcript to LLM (API-based)
  4. Generate response
  5. Convert response to speech via TTS
  6. Stream/play audio back

Right now, the system is not streaming end-to-end — it’s more of a sequential pipeline.

[This is just a college project so free tools are much appreciated :)]
I also dont have much experience with these kinds of projects so I'm just vibe coding this right now :|

reddit.com
u/Useful-Thing-1400 — 3 days ago