u/wassupabhishek

I have noticed that everyone talks about prompt engineering as if it’s just tweaking prompts against some metrics/goals. But in reality most agent failures are impossible to debug because multiple things changed at once. You change the system prompt, model version, retrieval logic, or maybe the underlying data.

This is what has worked for me so far, and I want to validate if anyone in the community has a similar approach:

Build a baseline first, run the current setup for 1–2 weeks with proper logging before touching anything. Change just 1 variable at a time. Do percentage rollouts for example ~10% of production traffic to the new variant first. Let it run for at least 48 hours. Then wait for enough volume. A lot of teams conclude from just a few conversations. Usually need a few hundred interactions before results mean anything. Define rollback criteria clearly before rollout. What counts as failure should be decided before deployment.

The bigger issue is that most teams don’t actually have infra for systematic prompt evals or rollouts. A lot of LLMOps still end up being logging and manual reviews.

Curious what people here are actually using for this in production.

Any existing feature flag tools?
Custom infra?
Langfuse / Helicone / Braintrust?
Fully internal platforms?

A thing that surprised me while digging into agent reliability is that a model with 95% accuracy per step sounds excellent. But if your agent takes 10 steps to complete a task, the overall success rate drops to ~60%. And at 100 steps, it’s basically unusable (~0.6%). The failure compounds fast.

Then I came across a few numbers that made this feel less theoretical. Datadog tracked 8.4M AI model request failures in March 2026 and reported that ~5% of AI requests fail in production. A large chunk of these aren’t infra outages, but logic/quality failures that teams can’t properly debug. Similarly, McKinsey in its report said that while many enterprises are experimenting with agents, very few are actually scaling them successfully in production.

The more I look at this, the more it feels like an experimentation infrastructure problem, not a model capability problem. Most teams still test agents in playgrounds/staging and then hope production behaves similarly. But prompts, tools, memory, routing, temperature, context length, fallback logic, etc. all interact in weird ways under real traffic.

Web teams solved this years ago with A/B testing and controlled rollouts. Feels like agent teams need the same thing. Like experiment on live traffic, compare prompt/config variants, isolate regressions, and measure task success over time.

Curious if you agree to this or think there are better ways to solve these production issues.

How to A/B test system prompts in production?

Need your feedback on my assumption on how to prevent agents from failing