I built an agent that controls the Unity Editor over WebSocket instead of just generating code architecture writeup
most "AI for game dev" tools either generate C# and hand it to you, or live inside the Editor as a chat plugin. both have the same problem they can't see runtime state, so they can't tell you whether what they intended actually happened. you paste in a script, something's off, and you don't know if it's the agent's fault, your project, or a serialized reference that didn't update.
i've been building a different approach. desktop app that holds a live websocket connection to the Editor. the agent reads console output, inspects actual component values, and verifies that the operations it executed produced the expected state. static project context, augmented by live runtime state.
stack:
- Electron + React on the client, LangGraph.js agent in the main process
- C# bridge package inside Unity that listens on a websocket and executes operations via Editor APIs
- Next.js control plane proxying LLM calls (Anthropic direct + OpenRouter)
- Qdrant for RAG, retrieved via tool call rather than system-prompt injection (system-prompt injection kills cache hits)
stuff that worked:
- consolidating 18 granular tool wrappers down to 5-7 workflow tools. way better tool-selection accuracy, fewer compounding errors across steps.
- two-tier model setup with prompt caching wired end-to-end. Haiku for the fast stuff, Sonnet for harder multi-step tasks. warm sessions are way cheaper than running this without caching.
- verifying every operation by reading back state. catches a lot of silent failures (component added to wrong object, ref not propagated, etc).
stuff that didn't:
- spatial reasoning is a model problem, not a tooling problem. perfect runtime visibility doesn't help the agent figure out why the camera clips through a wall.
- early attempts at giving the model lots of granular tools - more options just made it worse at picking.
- trying to jam RAG into the system prompt for "always-on context." killed caching, cold start cost dominated.
next up is play-mode integration so the agent can actually run the game, watch what happens, and iterate. right now runtime visibility is read-only.
curious what other people building editor-style agents (for any tool, not just Unity) are running into. the runtime-state-vs-static-context tradeoff feels general.