How to A/B test system prompts in production?
I have noticed that everyone talks about prompt engineering as if it’s just tweaking prompts against some metrics/goals. But in reality most agent failures are impossible to debug because multiple things changed at once. You change the system prompt, model version, retrieval logic, or maybe the underlying data.
This is what has worked for me so far, and I want to validate if anyone in the community has a similar approach:
Build a baseline first, run the current setup for 1–2 weeks with proper logging before touching anything. Change just 1 variable at a time. Do percentage rollouts for example ~10% of production traffic to the new variant first. Let it run for at least 48 hours. Then wait for enough volume. A lot of teams conclude from just a few conversations. Usually need a few hundred interactions before results mean anything. Define rollback criteria clearly before rollout. What counts as failure should be decided before deployment.
The bigger issue is that most teams don’t actually have infra for systematic prompt evals or rollouts. A lot of LLMOps still end up being logging and manual reviews.
Curious what people here are actually using for this in production.
- Any existing feature flag tools?
- Custom infra?
- Langfuse / Helicone / Braintrust?
- Fully internal platforms?