Hybrid cloud + local LLM stack for a real-time game coaching app, what I learned
Lead dev at a small indie studio. Just shipped fine-tuned personas for a CS2 coaching tool with a hybrid architecture I wanted to share because the design tradeoffs were interesting.
Stack:
- Primary inference: Groq cloud, Llama 3.3 70B for the text coach, Llama 4 Scout 17B for vision, with 8B fallback on rate limits
- Local fallback: Llama 3.1 8B base with 4 LoRA adapters fine-tuned per persona (harsh, analytical, patient, pattern-observer), served via Ollama + llama.cpp
- Routing: cloud first if tokens available, local fallback if cloud unavailable or user is on free tier
The reason for the hybrid: cloud gives you the quality ceiling, local gives you the privacy/cost floor. Free-tier users and offline play hit Ollama. Paid users hit Groq for the better reasoning. Same persona prompts across both paths, just different backends.
What I learned on the local fine-tuning side (the part most people in this sub care about):
What worked:
- Hand-authored training data beat synthetic at small scale. 200 hand-written examples per persona outperformed 2000 generated ones. Synthetic sounded right but was structurally wrong, too verbose and hedge-y.
- Voice spec documents before training data. 2-3 page spec per persona (what words they use, pacing, failure modes), then training data written against the spec. Without the spec, training data drifts.
- Personas with focused scenario coverage beat personas trying to be good at everything.
What failed:
- LoRA dropout above 0.05 with rank 8 on a 500-example dataset overfit hard. Loss dropped to 0.05 in 2 epochs and the model memorized training data verbatim, including meta-instructions like "respond in the voice of...". Retrained with dropout=0, loss landed at 1.2, usable.
- Pattern-recognition persona was the hardest by far. Multi-round implicit-state reasoning is genuinely hard at 8B. Closed-form math (round equity, buy decisions) was trivial in comparison.
Infrastructure stuff:
- GGUF export is fragile. Version mismatches between llama.cpp and conversion tooling cost me 2 days. Lock the conversion env, version-pin everything.
- Eval harness is its own problem. Loss numbers don't tell you if a persona feels right. I run the same scenario through all 4 personas and read outputs side by side. That subjective check caught more issues than any automated metric.
What I'm still figuring out:
- Hybrid routing observability. When cloud falls through to local, the user experience differs subtly. Capturing where the handoff happened and how output quality compares is something I haven't solved cleanly.
- Post-deployment feedback loop. User thumbs up/down becomes the next training set, but quality-gating is hard. Novice flagging an expert call as wrong is anti-signal. Working on a skill-weighted feedback system but it's not done.
Happy to answer questions on hyperparameters, hybrid routing decisions, GGUF wrangling, persona design, eval harness, whatever. The hybrid architecture stuff in particular doesn't get talked about much in this space, mostly because everyone's either pure cloud or pure local. There's a real middle ground.
Discord if you want to follow along: https://discord.gg/tTE5aFeq
Steam page: https://store.steampowered.com/app/4659510/Game_Demon