Giving Gemini Live a tiny body with StackChan
I’ve been working on custom firmware for StackChan.
The idea is to give an LLM a small, funny physical body.
I don’t want to move away from “cute face connected to an LLM”. That part is exactly what I like. I just want the cute face to have memory, skills, tools, smart home control, and enough context to feel less like a fresh chatbot session every time.
The main conversation runs through Gemini Live. StackChan handles the body: face, speaker, microphone, servos, camera, touch/tap behavior, and local timing. Around that, I added memory, skills, tools, Home Assistant control, and local scheduled tasks that the model can use during the conversation.
Some of the parts I’m most happy with:
- Persistent memory and recent context. StackChan can remember useful information across conversations and keep recent dialogue context, so I do not have to remind it what we talked about an hour ago.
- Compact memory summaries. Longer interactions can be summarized into a more compact form instead of dumping the entire chat history into every prompt.
- Memory search / lightweight RAG-like recall. The robot can search its own memory when it needs something, rather than putting every stored fact into the model context all the time.
- Separate sensitive storage. Some information can live in a separate storage/tool layer and only be opened when needed, instead of being constantly sent to the model.
- Learnable skills. I can teach StackChan new routines as instruction sets. It can store them, search for them later, and reuse them when relevant. Skills can also include tool usage, including camera actions.
- Tool use during live conversation. The LLM can call tools exposed through the firmware/gateway: memory, skills, camera, gestures, Home Assistant, search, and other robot-side actions.
- Home Assistant control. StackChan can see what devices are available in Home Assistant and control them by voice: turn lights or switches on/off, check device state, and use smart home actions as part of a conversation.
- Local timers and scheduled tasks. I added a scheduling system where the robot can create reminders/timers/tasks during conversation. These are stored and checked on the robot side, so it can trigger something later instead of just answering once and forgetting.
- Camera and look-around behavior. StackChan can use its camera with different look directions instead of just taking one static picture. This makes camera-based skills feel more like the robot is actually looking around.
- Optional local Hermes integration. It can talk to my local Hermes assistant on the same network, but that is optional. I do not want the robot to depend on outsourcing everything to a bigger assistant.
A lot of the work has been less about “can I connect an LLM to a robot?” and more about making the loop feel usable on a tiny physical device.
Some problems I ran into:
- audio feedback, where the robot hears itself speaking
- microphone gain and noise tuning
- keeping voice, sound effects, and listening state from stepping on each other
- slow tool calls during a live conversation
- limited RAM and timing issues on the ESP32 side
- camera initialization/release stability
- deciding what should run on the robot, what should run through the gateway, and what should stay local/private
- making memory useful without turning every prompt into a giant context dump
- making scheduled tasks reliable without relying on an external server for every reminder
It is still experimental, but it is already fun to use) I can talk to StackChan naturally, teach it little routines, have it remember things, control smart home devices, create timers, ask it to use tools, make it look around with the camera, and let it pull relevant context back into the conversation when needed.