Setup: mid-size SaaS, ~3,000 tickets/month, 6 agents drowning. 70% of volume was tier-1 (passwords, billing, where's-my-feature).
Architecture (kept boring on purpose)
- Trigger: new ticket in Zendesk
- Reasoning: Claude Sonnet. Cheap classification: GPT-4o-mini
- Tools: Zendesk read, product DB read-only, Stripe read-only, RAG over 400 KB articles, email API (gated)
- Memory: short-term (current ticket) + long-term (last 30 days of customer history)
- Human checkpoint: confidence < 0.85, refunds, cancellations, enterprise tier
What worked
Started with passwords + billing only (~30% of volume). Got to 80% deflection on those before adding anything else.
Verifiable answers only. Agent could only respond if it could cite a KB article or pull a fact from the DB.
Real human checkpoint. Agents reviewed 100% of responses for the first 30 days. Caught real problems.
Confidence classifier. Trained on "would this response have been edited by a human." Used as the gate.
What blew up
First version had no human checkpoint. Hallucinated a feature that didn't exist. Customer was furious. 2 weeks of internal trust gone. Don't skip this.
Tried refunds in v1. Bad idea. Refunds are 80% emotional, 20% process. Agent gave correct-but-cold responses. Pulled it out.
Long-term memory got creepy. Agent surfaced a 6-month-old complaint that wasn't relevant. Tightened scope.
Tone matching took 3 iterations. Default LLM tone is too formal. Fine-tuned with 50 example responses from our best agent.
Cost spiked early. v1 made 5 LLM calls per ticket. Got it to 2. Cost dropped 60%.
Numbers at 90 days
- 47% fully deflected (no human touched them)
- 22% drafted by agent, sent in <30 sec by human
- CSAT 4.6/5 (was 4.5)
- $0.18 per ticket in LLM + infra (was ~$3.50 in human cost)
- Support team did NOT shrink. They handle the hard tickets that used to wait in queue.
Lessons
- Pick a workflow that's repetitive AND verifiable
- Human in the loop is not optional in v1
- Confidence scoring is what makes it production-safe
- Optimize prompts, not models, first
- Boring architecture beats clever architecture