u/Mental-Address122 — reddlx

Setup: mid-size SaaS, ~3,000 tickets/month, 6 agents drowning. 70% of volume was tier-1 (passwords, billing, where's-my-feature).

Architecture (kept boring on purpose)

- Trigger: new ticket in Zendesk

- Reasoning: Claude Sonnet. Cheap classification: GPT-4o-mini

- Tools: Zendesk read, product DB read-only, Stripe read-only, RAG over 400 KB articles, email API (gated)

- Memory: short-term (current ticket) + long-term (last 30 days of customer history)

- Human checkpoint: confidence < 0.85, refunds, cancellations, enterprise tier

What worked

Started with passwords + billing only (~30% of volume). Got to 80% deflection on those before adding anything else.
Verifiable answers only. Agent could only respond if it could cite a KB article or pull a fact from the DB.
Real human checkpoint. Agents reviewed 100% of responses for the first 30 days. Caught real problems.
Confidence classifier. Trained on "would this response have been edited by a human." Used as the gate.

What blew up

First version had no human checkpoint. Hallucinated a feature that didn't exist. Customer was furious. 2 weeks of internal trust gone. Don't skip this.
Tried refunds in v1. Bad idea. Refunds are 80% emotional, 20% process. Agent gave correct-but-cold responses. Pulled it out.
Long-term memory got creepy. Agent surfaced a 6-month-old complaint that wasn't relevant. Tightened scope.
Tone matching took 3 iterations. Default LLM tone is too formal. Fine-tuned with 50 example responses from our best agent.
Cost spiked early. v1 made 5 LLM calls per ticket. Got it to 2. Cost dropped 60%.

Numbers at 90 days

- 47% fully deflected (no human touched them)

- 22% drafted by agent, sent in <30 sec by human

- CSAT 4.6/5 (was 4.5)

- $0.18 per ticket in LLM + infra (was ~$3.50 in human cost)

- Support team did NOT shrink. They handle the hard tickets that used to wait in queue.

Lessons

- Pick a workflow that's repetitive AND verifiable

- Human in the loop is not optional in v1

- Confidence scoring is what makes it production-safe

- Optimize prompts, not models, first

- Boring architecture beats clever architecture

The difference between MVPs that worked and the ones that died wasn't tech stack. It was 5 founder behaviors. Here they are.

Context: I've been adjacent to 100+ MVP builds across SaaS, fintech, marketplaces, and AI tools. Watched some become 8-figure companies. Watched more die. The patterns are surprisingly consistent.

1. The founder can describe the user in one sentence.

Winners: "This is for procurement managers at industrial equipment companies doing >$10M/year in PO volume."

Losers: "It's for businesses that want to be more efficient."

Specificity at the user level is the highest-correlation predictor. By a lot.

2. The first version says no to more features than it says yes.

Successful MVPs ship 1 thing well. Failed ones ship 5 things poorly. Every founder who insisted on "just one more feature" before launch shipped late and ran out of money 6 months earlier than they should have.

3. The founder is in the build, not orchestrating it.

Founders who win are in design reviews, use the alpha, write bug reports themselves, argue about copy. The ones who hand a spec to a team and check in monthly almost always rebuild because they didn't know what they wanted until they used it.

4. They put the MVP in front of paying users before launch.

Not beta users. Not survey respondents. Paying users. Founders who pre-sold to 5–10 customers shipped products that found PMF in months. The build-first crowd usually rebuilt the whole thing.

5. They have a clear answer to 'what would have to be true to kill this in 6 months.'

Winners can name the riskiest assumption and the test that proves it. Losers keep moving the goalposts.

What didn't matter:

- Tech stack

- In-house vs outsourced

- Whether the founder was technical

- How polished the MVP looked

- Whether they raised money

Five years from now, the founders following 1–5 will still be building. The rest will be on their next thing.