I built an offline multi-modal AI assistant (Voice + Vision) that runs locally on my laptop
Hey guys,
I wanted to share a side project I've been building on my laptop for the past few weeks. It's called HERO ZAN, and it's basically a fully offline, private AI assistant that can speak, listen, and see through the webcam without using any external APIs or cloud services.
I wanted something that supports Arabic natively, has a low latency, and doesn't melt my system resources. Here is the stack I ended up using to make it work:
Ollama as the backend for the LLM (I'm using qwen2.5-coder:7b since it handles Arabic really well and gives solid reasoning).
Faster-Whisper (medium model) for speech-to-text. It's surprisingly fast on local hardware.
Piper TTS for the voice output. Finding a good, natural-sounding local Arabic TTS was a pain, but Piper ONNX models did the trick.
Moondream (via Ollama) for the vision part. If you ask it "شايف إيه؟" (What do you see?), it grabs a frame from the webcam and describes it.
CustomTkinter for a simple GUI, featuring a small animated cartoon face that changes its expression depending on what the assistant is doing (thinking, listening, talking, etc.).
Everything runs locally on my machine (I'm currently testing it on a standard AMD Ryzen 5 Pro setup with 8GB RAM, and it runs smoothly without choking the system). It also has local chat history and an optional local web search via DuckDuckGo if needed.
The main reason I built this was to prove to myself that we don't need massive server farms or expensive API subscriptions to have a functional, multi-modal assistant that respects privacy 100%.
The code is fully open-source. If you want to check it out, run it locally, or contribute, here is the repo:
https://github.com/MHR-X/hero-zan
Let me know if you have any questions about the setup, the Piper TTS integration, or the performance!