u/Heavy_Fisherman_3947

The goal was to create a low-latency conversational avatar that could handle the full real-time flow from ASR → LLM → TTS → lip-sync rendering → WebRTC streaming. The biggest challenge was orchestration and response speed, since even a few seconds of delay makes the interaction feel unnatural.

The system uses a React + Vite frontend for UI and streaming, Next.js API routes for authentication and instance management, and a WebRTC-based AI avatar pipeline for real-time voice and video delivery. The browser publishes audio, the AI agent generates a response, and the avatar renders voice and lip-sync back to the browser in real time.

I also included token generation, agent registration, instance creation, frontend streaming logic, and cleanup handling for both RTC and server-side resources.

GitHub repo: GitHub Project

Curious how others are handling latency and orchestration for real-time AI avatar systems.

I Built a WebRTC-based real-time AI Avatar