u/rohynal

▲ 4 r/learnmachinelearning+1 crossposts

Does AI behavior reset too easily across runtimes?

One pattern I keep seeing with AI agents:

You finally get an agent's behavior dialed in:

  • boundaries
  • approvals
  • dos/don'ts
  • escalation behavior

Then the context or runtime changes and you end up re-teaching everything again.

Not just annoying. Potentially risky once agents start touching real systems and irreversible actions.

Feels like there's a missing portability layer for behavioral expectations across tools/runtimes.

Curious whether people think this eventually gets solved through:

  • prompts
  • runtime semantics
  • MCP-style layers
  • policy artifacts
  • something else entirely

Or whether this is just the cost of building with agents right now.

reddit.com
u/rohynal — 8 hours ago
▲ 9 r/ReplitBuilders+1 crossposts

Anyone else constantly re-teaching AI agents the same behavior?

You spend hours shaping an agent:

  • what tools it can touch
  • what it should ask before acting
  • what counts as risky
  • when it should stop and clarify

Eventually it mostly behaves.

Then the surface changes: new runtime, new coding tool, new MCP server, new workflow…

…and suddenly you're re-explaining the same expectations all over again.

Feels like a lot of this stuff currently lives in prompts, habits, and the operator's head instead of surviving across surfaces.

Curious how others are handling this.

Prompts? Policy files? Wrappers/hooks? MCP? Just accepting the drift?

reddit.com
u/rohynal — 10 hours ago

We started measuring "undeclared-intent spend" in agent workflows

Was extending some internal tooling this week and ended up building a metric I didn't expect to care about this much: undeclared-intent spend.

The idea is simple. If an agent session declares it's trying to do A, but reasoning turns later touch systems or execution paths outside that declared intent, how much compute went toward that work?

Example output from one session:

Total compute     5,137 tokens
Undeclared        1,173 tokens   (22.8%)
Declared          3,964 tokens   (77.2%)

What's interesting about this isn't governance language or policy enforcement. It's that unintended execution now has a measurable operational cost.

Retries cost money.
Loops cost money.
Reasoning drift costs money.
Off-task execution costs money.

The more time I spend tracing agent systems, the more it feels like cost is becoming a behavioral signal, not just billing telemetry.

One subtle thing we ran into while building this: sometimes "undeclared" genuinely reflects drift, where the agent wandered into systems it wasn't supposed to touch. Sometimes the runtime surface itself doesn't expose enough information to determine intent cleanly, and "undeclared" is really "indeterminable from here."

That distinction ended up mattering a lot more than I expected, because the two failure modes deserve very different responses.

Curious whether others running agents in production are thinking about off-task compute this way yet, or if most teams are still treating token spend purely as a billing and optimization problem.

Specifically interested in whether anyone has tried to put a number on drift that wasn't just "the bill went up."

reddit.com
u/rohynal — 2 days ago
▲ 3 r/ReplitBuilders+1 crossposts

Was wiring token tracking into our Governor and ran into something that's been bothering me.

If one LLM reasoning step produces three tool calls, and your observability stack attributes the same token spend to all three events, your downstream analytics are mathematically wrong. Not slightly wrong. Structurally wrong.

Concrete example from a single agent session I ran:

  • Naive event-level aggregation: 14,436 prompt tokens
  • Attributed correctly at the reasoning-step level: 4,812 prompt tokens
  • A 3x overstatement, silently, on one workflow

The fix is straightforward: every reasoning step needs an identity (we use llm_turn_id), and token spend attaches to the step, not to each downstream tool call. Aggregation becomes dedupe-safe by construction.

What's been bothering me more is the second-order implication.

In non-deterministic agent systems, the normal ways we think about correctness start breaking down. One of the things that starts replacing it is cost. Retries cost money. Loops cost money. Reasoning drift costs money. Every operational pathology shows up, eventually, in tokens.

Which means cost stops being just billing telemetry and becomes one of the few accountability surfaces that survives non-determinism. But only if the attribution is structurally correct. Otherwise you're not measuring agent behavior. You're measuring an artifact of how your trace events were aggregated.

Curious whether others are also starting to read cost as a behavioral signal rather than just billing, or if I'm reading too much into a single workflow.We found a 3x token attribution distortion in a single agent workflow

reddit.com
u/rohynal — 6 days ago

Was wiring token tracking into our Governor and ran into something that's been bothering me.

If one LLM reasoning step produces three tool calls, and your observability stack attributes the same token spend to all three events, your downstream analytics are mathematically wrong. Not slightly wrong. Structurally wrong.

Concrete example from a single agent session I ran:

  • Naive event-level aggregation: 14,436 prompt tokens
  • Attributed correctly at the reasoning-step level: 4,812 prompt tokens
  • A 3x overstatement, silently, on one workflow

The fix is straightforward: every reasoning step needs an identity (we use llm_turn_id), and token spend attaches to the step, not to each downstream tool call. Aggregation becomes dedupe-safe by construction.

What's been bothering me more is the second-order implication.

In non-deterministic agent systems, the normal ways we think about correctness start breaking down. One of the things that starts replacing it is cost. Retries cost money. Loops cost money. Reasoning drift costs money. Every operational pathology shows up, eventually, in tokens.

Which means cost stops being just billing telemetry and becomes one of the few accountability surfaces that survives non-determinism. But only if the attribution is structurally correct. Otherwise you're not measuring agent behavior. You're measuring an artifact of how your trace events were aggregated.

Curious whether others are also starting to read cost as a behavioral signal rather than just billing, or if I'm reading too much into a single workflow.

reddit.com
u/rohynal — 6 days ago

Been spending a lot of time in r/AI_Agents and r/ArtificialInteligence since launching our Governor module, and I keep noticing the same thing:

Different teams describe the same operational pain using completely different vocabularies.

Some call it observability.
Some call it drift.
Some call it logging.
Some call it debugging.
Some call it performance.

But underneath all of them is the same gap:

The agent did something different from what the operator believed, expected, or intended.

What’s becoming clearer to me is that a lot of the industry is trying to force deterministic behavior onto fundamentally non-deterministic systems.

That feels like the wrong target.

You probably can’t make execution deterministic.
You probably can deterministically understand intent.

Curious if others building/running agents are seeing the same pattern.

reddit.com
u/rohynal — 7 days ago

Been spending a lot of time in r/AI_Agents and r/ArtificialInteligence since launching our Governor module, and I keep noticing the same thing:

Different teams describe the same operational pain using completely different vocabularies.

Some call it observability.
Some call it drift.
Some call it logging.
Some call it debugging.
Some call it performance.

But underneath all of them is the same gap:

The agent did something different from what the operator believed, expected, or intended.

What’s becoming clearer to me is that a lot of the industry is trying to force deterministic behavior onto fundamentally non-deterministic systems.

That feels like the wrong target.

You probably can’t make execution deterministic.
You probably can deterministically understand intent.

Curious if others building/running agents are seeing the same pattern.

reddit.com
u/rohynal — 7 days ago
▲ 2 r/ReplitBuilders+1 crossposts

I’m starting to think most “agent bugs” aren’t bugs. They’re mismatches between what we think we asked and what the agent thinks we asked.

That got me thinking about how we frame agent observability.

Most of the conversation treats the gap between what an agent claims it’s doing and what it actually does as a governance problem. Catch bad actions. Stop the agent before it deletes the wrong database.

That’s real. But I’m seeing something else.

A lot of developers are using the same idea for a completely different purpose: debugging their own assumptions about the model.

Examples I keep hearing:

  • Someone spent weeks debugging ranking issues, only to realize the prompt wasn’t being interpreted the way they thought.
  • Output drift that wasn’t a bug. The agent was doing exactly what it believed it was asked to do.
  • Instruction-following gaps where the agent technically followed instructions, just not in the way the operator expected.

In all these cases, the developer wasn’t catching the agent. They were catching themselves.

The most useful signal wasn’t the output. It was reconstructing:
what did I think I asked vs what did the agent think I was asking?

That makes me wonder if the “failure/incident” framing for observability is too narrow.

“Intent vs execution” might not just be for governance. It might be one of the most useful debugging primitives for everyday agent work.

Curious how others are handling this:

  • Are you debugging prompt interpretation / output drift by reconstructing the agent’s understanding?
  • What does that look like in practice? Logs, eval traces, reruns, something else?
  • Does “claim vs action” resonate here, or does it feel like the wrong vocabulary outside governance?

(For context, I’ve been exploring this space and built a small open-source tool around it. Happy to share if relevant, but mostly interested in whether this pattern resonates.)

reddit.com
u/rohynal — 8 days ago

I’ve been running coding and workflow agents in my own setup for the past couple of months and kept running into the same issue:

When something went wrong, I couldn’t reconstruct what the agent thought it was doing versus what it actually did.

Tool-call logs showed operations, but not the reasoning behind them.

So I added a simple trace layer around my own sessions.

On one recent Claude Code run:

  • 2,830 events
  • 3,256 rule violations (multiple flags can fire per event)

The patterns were consistent:

  • no declared intent
  • scope expanding across tool calls
  • memory writes happening without classification

Most of this never showed up in the logs I was reading.

The biggest shift for me was how it changes how you debug. Instead of reading tool calls, you start asking:

  • what was this agent supposed to be doing?
  • where did it stop doing that?

I turned this into a small local tool so I could keep running it across sessions.

It’s basically:

  • a wrapper around tool calls
  • a fixed event schema (intent, scope, context, memory)
  • a CLI that summarizes where behavior diverges

No cloud, no accounts, no enforcement. Just visibility.

Appreciate any feedback the community can offer.

reddit.com
u/rohynal — 9 days ago