u/ZealousidealCorgi472

[D] The agent memory ordering problem loading past context before current evidence creates anchoring bias

Ran into something subtle while building a diagnostic agent for LLM quality monitoring that I haven't seen written about much. Posting because it might be useful for others building similar systems.

The agent investigates why LLM quality dropped. It has access to past investigation episodes stored in a database — what the agent found last time quality dropped, what the fix was.

My first implementation loaded these past episodes into the system prompt before the agent ran. The idea was to give the agent context about what it had seen before.

The problem: the agent would read "we saw this pattern 3 weeks ago, root cause was prompt structure" before looking at any current evidence. Then it would run fetch_recent_traces, see the current failing cases, and anchor its analysis on the past pattern even when the current regression was a completely different bug class. It was essentially "we've seen this before" before it had looked at "what are we actually seeing now."

This is the same anchoring bias humans exhibit — first information you receive disproportionately influences interpretation of subsequent information. I had accidentally baked it into the agent's context loading order.

The fix was simple once I understood the problem: inject episodic memory into context AFTER the first tool call completes, not before. The agent collects fresh evidence first, then has access to historical patterns for comparison. The ordering changed from:

[past context] → [current query] → investigate

To:

[current query] → investigate → [first tool result + past context] → continue investigation

After this change the agent stopped misidentifying new failure modes as previously-seen patterns. Diagnoses became noticeably more accurate on cases where the current regression was superficially similar to a past one but had a different root cause.

The broader principle: for agents that use episodic memory, the insertion point of historical context into the reasoning chain matters as much as whether you include it at all. Historical context is most useful as a reference AFTER gathering current evidence, not as a frame BEFORE examining current evidence.

Curious whether others have run into this. Is there a principled way to decide when to inject different memory types? I've been thinking about it as: in-context and project context at the start (defines the task and scope), semantic search results and episodic memory after first tool call (reference after fresh observation), never in the system prompt for anything time-sensitive.

Does that hold up? Or are there cases where historical context should come first?

reddit.com

TraceMind – open source LLM quality monitoring with a ReAct agent that investigates why your AI started giving wrong answers

Background: I was building a multi-agent system. Changed one line in a system prompt. Quality dropped from 84% to 52% pass rate.

HTTP 200 the whole time. Found out 11 days later from a user. That incident made me realize LLM apps have a monitoring gap that doesn't exist in traditional software. When a database query returns the wrong rows, you usually find out fast. When an AI response is factually wrong, everything still looks healthy — correct status codes, normal latency, zero errors. The failure is completely invisible to standard tooling.

I spent a few months building TraceMind to solve this. Here's what it actually does:

**Automatic background scoring**

Every LLM call that goes through the SDK gets scored automatically within 10 seconds. The judge returns a number AND a one-sentence explanation — "Response contradicted the refund policy stated in context." A score of 4.2 with no explanation isn't actionable. 4.2 with a reason is.

The scoring is decoupled from ingestion. The HTTP endpoint returns 202 in under 10ms regardless of what the judge is doing. Your app never waits for TraceMind.

**The part I'm most interested in — root cause investigation**

When quality drops, most tools show you a chart. You still have to figure out why.

I built an EvalAgent a ReAct loop with 6 tools: fetch recent failing traces, search past failures by semantic similarity (ChromaDB + local sentence-transformers), run targeted evals, analyze failure patterns using a 70B model, generate new test cases for the identified failure mode, and send alerts.

You ask it in plain English. It runs a loop:

THINK → what do I need to understand this?

ACT → call a tool to get that information

OBSERVE → what did the tool reveal?

REPEAT

Average 4-5 tool calls. About 45 seconds. Returns a specific root cause and specific fix — not a dashboard to interpret.

**Some architectural decisions that might be interesting:**

Text-based ReAct instead of native tool calling. I'm running on Groq's free tier with smaller open models. Native tool calling on 8B-70B models is unreliable — they hallucinate tool names and produce malformed schemas. Text-based ReAct is more forgiving. Parse failures are recoverable. Malformed native tool schemas often aren't.

Four memory types in the agent: in-context working memory, project context, episodic memory from past runs (last 5 stored in Postgres), and semantic memory in ChromaDB. The ordering matters — past episodes load AFTER the first tool call, not before. Loading them first creates anchoring bias where the agent reads "we saw this pattern" before looking at current evidence and misdiagnoses new bugs as known patterns.

Hallucination detection in 3 stages with json_mode=False. Groq's JSON mode forces object format and breaks array extraction. Took me an embarrassingly long time to debug that one.

Multi-sample judge runs twice, takes the median. Single-sample LLM judges vary by ±0.7 on identical inputs. That variance is enough to flip a case from passing to failing between eval runs.

**What it doesn't do well (honest)**

DeepEval has better task-specific metrics for RAG — faithfulness, answer relevance, contextual precision. These are more credible than a general LLM judge for RAG-specific evaluation. If you're primarily evaluating RAG pipelines, DeepEval's metrics are probably more useful.

The multi-tenancy is application-layer isolation, not row-level security. Fine for a team of one or a small company, not right for serving hundreds of organizations.

**Stack:** FastAPI + Python 3.11, React 18 + TypeScript, PostgreSQL + ChromaDB, Groq (Llama 3.1 8B / 3.3 70B), sentence-transformers local, Alembic, slowapi.

76 unit tests. 44/44 end-to-end verification checks against

the live server. Runs entirely on Groq's free tier — $0.

GitHub: github.com/Aayush-engineer/tracemind

Would genuinely value feedback from people doing LLM evals in production — especially whether the agent investigation is useful in practice or just interesting in theory.

reddit.com

Running evals locally without paying for OpenAI — what's your setup?

Tried to set up LLM-as-judge eval for a local project. First instinct was GPT-4o as the judge. Then I saw the bill estimate for running 500 eval cases daily and decided against it.

Switched to running the judge locally. Tried a few things:

Llama 3.1 8B: fast, cheap, inconsistent on nuanced rubrics

Llama 3.3 70B via Groq free tier: much better consistency, still free for moderate volume

Mixtral 8x7B: decent middle ground

The interesting finding: for binary pass/fail judgments, 8B is fine. For nuanced 1-10 scoring with detailed criteria, you really want 70B. The smaller models grade inflate and miss subtle failures.

Also found that prompt length matters more with smaller models they struggle to follow long rubrics consistently. Shorter, explicit criteria outperform detailed rubric paragraphs.

Anyone running eval pipelines on local models? What model/setup are you using for the judge?

reddit.com
u/ZealousidealCorgi472 — 6 days ago

My agent returns HTTP 200 but gives factually wrong answers. How are you catching this?

Working on a support agent and hit a gap I hadn't thought about.

Agent completes successfully. No exceptions. Normal latency. But the answer is wrong tells the user the return window is 60 days when the actual policy is 30. Nothing in my logs shows anything unusual.

With normal backend services, failures are obvious. With LLM agents, the service can be completely healthy while giving wrong answers to every user.

Things I've tried so far:

- Running evals on test cases before each deploy

- Scoring a sample of live responses in the background

- Checking responses against retrieved context for RAG flows

The part I'm still stuck on isn't detection it's root cause. Was it a prompt change? Did the model start behaving differently on certain inputs? Did the distribution of user questions shift?

What does your setup look like for catching wrong answers, not just failed requests?

reddit.com
u/ZealousidealCorgi472 — 6 days ago

A production AI agent I was testing started giving confidently wrong answers after a small prompt change.

Nothing looked broken:

  • requests were succeeding
  • latency stayed normal
  • logs looked clean

But the agent’s actual behavior got noticeably worse and I only realized it much later from user feedback.

That made me realize most of our monitoring for AI systems is still infrastructure-level:

  • uptime
  • latency
  • token usage
  • retries
  • cost

…but almost nothing checks whether the agent is still behaving correctly over time.

Especially with multi-step agents, failures can be subtle:

  • wrong tool selection
  • hallucinated reasoning
  • retrieval drift
  • incorrect memory usage
  • prompt regressions
  • overconfident answers

The scary part is the system can look perfectly healthy while quality quietly degrades in production.

After running into this a few times I started building internal tooling around semantic monitoring for agents:

  • sampled response evals
  • hallucination checks
  • tracing agent decisions/tool calls
  • comparing prompt versions
  • retry/flagging for suspicious outputs
  • clustering recurring failures

One thing I didn’t expect:
LLM-as-judge evals were surprisingly noisy. Same input, same output, different scores across runs. Ended up needing aggregation to get stable signals.

Curious how people here are handling this for real-world agents.

Are you mostly relying on:

  • offline eval suites?
  • human review?
  • shadow deployments?
  • custom observability pipelines?
  • user feedback loops?

Feels like a lot of AI agent monitoring is still pretty early.

reddit.com
u/ZealousidealCorgi472 — 7 days ago

A few weeks ago I changed a single line in a system prompt during a deploy.

Nothing looked wrong:

  • error rate stayed normal
  • latency looked fine
  • requests were returning 200s

But response quality got noticeably worse, and I only found out 11 days later because a user complained.

That honestly felt weird coming from normal backend engineering, where failures are usually obvious pretty quickly.

With LLM apps it feels like you can have a system that's technically healthy while giving bad answers the entire time.

Example:
support bot starts confidently saying refunds are valid for 60 days instead of 30.

No exception gets thrown.
No alert fires.
Everything looks green.

After that incident I started building some internal tooling to monitor semantic quality instead of just infra metrics.

Main things that ended up being useful:

  • running background evals on sampled responses
  • checking hallucinations against retrieval context
  • comparing prompt versions statistically instead of eyeballing outputs
  • retry/flagging when responses look suspicious
  • clustering failures to spot recurring patterns

One thing that surprised me:
LLM-as-judge scoring was way noisier than I expected. Running the same judge multiple times on identical inputs gave pretty different scores sometimes, so I started aggregating runs instead of trusting single outputs.

Curious what other people are doing for this in production.

Are most teams just running evals before deploys?
Human review?
Shadow traffic?
Custom judge pipelines?

Feels like "we found out from a user complaint" is still the default monitoring strategy for a lot of LLM apps.

reddit.com
u/ZealousidealCorgi472 — 7 days ago

I changed a system prompt. Quality dropped 84% → 52%. HTTP 200. No errors. Found out 11 days later from a user complaint.

Built TraceMind to solve this. It's free, self-hosted, runs on Groq free tier.

What it does:

- Auto-scores every LLM response in background

- Per-claim hallucination detection (4 types)

- ReAct eval agent that diagnoses WHY quality dropped

- Statistical A/B prompt testing (Mann-Whitney U)

- Python SDK — one decorator, nothing else changes

The agent investigation looks like this:

Step 1: search_similar_failures

→ Found 3 similar past failures (82% match)

Step 2: fetch_recent_traces

→ 14 low-quality traces in last 24h. Lowest score: 3.2

Step 3: analyze_failure_pattern

→ Root cause: prompt has no fallback for ambiguous questions

→ Fix: add explicit fallback instruction

45 seconds. Specific root cause. Specific fix.

Self-hosted, MIT license, no vendor lock-in.

Happy to answer any questions about the architecture.

reddit.com
u/ZealousidealCorgi472 — 9 days ago

I changed a system prompt. Quality dropped 84% → 52%. HTTP 200. No errors. Found out 11 days later from a user complaint.

Built TraceMind to solve this. It's free, self-hosted, runs on Groq free tier.

What it does:

- Auto-scores every LLM response in background

- Per-claim hallucination detection (4 types)

- ReAct eval agent that diagnoses WHY quality dropped

- Statistical A/B prompt testing (Mann-Whitney U)

- Python SDK — one decorator, nothing else changes

The agent investigation looks like this:

Step 1: search_similar_failures

→ Found 3 similar past failures (82% match)

Step 2: fetch_recent_traces

→ 14 low-quality traces in last 24h. Lowest score: 3.2

Step 3: analyze_failure_pattern

→ Root cause: prompt has no fallback for ambiguous questions

→ Fix: add explicit fallback instruction

45 seconds. Specific root cause. Specific fix.

Self-hosted, MIT license, no vendor lock-in.

Happy to answer any questions about the architecture.

reddit.com
u/ZealousidealCorgi472 — 9 days ago

I built an open source LLM monitoring tool that detects quality regressions before your users do

I changed a system prompt. Quality dropped 84% → 52%. HTTP 200. No errors. Found out 11 days later from a user complaint.

Built TraceMind to solve this. It's free, self-hosted, runs on Groq free tier.

What it does:

- Auto-scores every LLM response in background

- Per-claim hallucination detection (4 types)

- ReAct eval agent that diagnoses WHY quality dropped

- Statistical A/B prompt testing (Mann-Whitney U)

- Python SDK — one decorator, nothing else changes

The agent investigation looks like this:

Step 1: search_similar_failures

→ Found 3 similar past failures (82% match)

Step 2: fetch_recent_traces

→ 14 low-quality traces in last 24h. Lowest score: 3.2

Step 3: analyze_failure_pattern

→ Root cause: prompt has no fallback for ambiguous questions

→ Fix: add explicit fallback instruction

45 seconds. Specific root cause. Specific fix.

Self-hosted, MIT license, no vendor lock-in.

Happy to answer any questions about the architecture.

reddit.com
u/ZealousidealCorgi472 — 9 days ago
▲ 11 r/mlops+9 crossposts

I changed a system prompt. Quality dropped 84% → 52%. HTTP 200. No errors. Found out 11 days later from a user complaint.

Built TraceMind to solve this. It's free, self-hosted, runs on Groq free tier.

What it does:

- Auto-scores every LLM response in background

- Per-claim hallucination detection (4 types)

- ReAct eval agent that diagnoses WHY quality dropped

- Statistical A/B prompt testing (Mann-Whitney U)

- Python SDK — one decorator, nothing else changes

The agent investigation looks like this:

Step 1: search_similar_failures

→ Found 3 similar past failures (82% match)

Step 2: fetch_recent_traces

→ 14 low-quality traces in last 24h. Lowest score: 3.2

Step 3: analyze_failure_pattern

→ Root cause: prompt has no fallback for ambiguous questions

→ Fix: add explicit fallback instruction

45 seconds. Specific root cause. Specific fix.

GitHub: github.com/Aayush-engineer/tracemind

Self-hosted, MIT license, no vendor lock-in.

Happy to answer any questions about the architecture.