Our multi DB agent answered "how many customers" right and "count customers" wrong why.
Been building a PostgreSQL + MongoDB analytics agent and kept hitting one weird failure: the agent would answer a question correctly, then answer the same question worded slightly differently and get it wrong. Same LLM, same trial, same DBs.
Took us an embarrassing amount of time to trace because the traces looked clean the LLM was being called, just not for the part we thought. It was only used to pick which database to hit. The actual SQL came from a template bank keyed on exact question strings, and anything the bank didn't match fell through to a heuristic that emitted things like SELECT * FROM orders LIMIT 100.
Two things we're mid changing:
Templates as few-shot examples in the prompt, not a dispatch table. The LLM sees 3 5 curated (question, SQL) pairs and generates SQL for the live question. Embedding similarity to pick the nearest few shot beats regex/string matching.
Routing informed by schema, not just keywords. Our keyword map sent the token "ticket" to MongoDB while the authoritative table was support_tickets in Postgres. Column name introspection + a short intent classifier would have caught that.
If you've shipped this architecture how did you draw the boundary between "router LLM" and "writer LLM"? Keep templates at all, or skip and rely on schema + retrieved examples?