u/Sevealin_

Gemma4:26b with HA conversation slow TTFT

I really like Gemma4:26b because I can easily push 262k context on a single 3090, but Home Assistant tool calls are like a 40 second time to first token for me. I'm using the newest Ollama docker tag version. Non-tool responses through HA are almost instant, but when I have to pull in the tool definitions it's about a 25k token prompt and hits a bottleneck somewhere, then takes about 40 seconds to respond. No issues with qwen3.5:27b, responds only after a few seconds. I was hoping to get a serious tokens per second boost with MoE.

What is everyone else's experience with Gemma 4 on Home Assistant? Any other models you recommend?

reddit.com
u/Sevealin_ — 15 hours ago