u/MindPsychological140

How do you stop ConversationBufferMemory from re-injecting full tool outputs every turn?
▲ 2 r/LangChain+1 crossposts

How do you stop ConversationBufferMemory from re-injecting full tool outputs every turn?

Hey r/LangChain,

(Disclosure: I'm not a native English speaker and have dyslexia, so I used an LLM to clean up the wording. Code, benchmarks and live API receipts are mine.)

I have a coding agent that re-feeds yarn.lock / pnpm-lock.yaml output into the prompt every turn. With stock `ConversationBufferMemory` I hit Gemini's `400 INVALID_ARGUMENT "exceeds 1048576"` after just 2 turns because every previous tool output gets re-injected verbatim. To prove this isn't a synthetic strawman, I ran a 6-turn agent on a payload built from two real public lock files — `facebook/react/yarn.lock`

(823 KB) and `vercel/next.js/pnpm-lock.yaml` (1.31 MB), ~2 MB / 1M cl100k tokens per turn and pointed it at Gemini 3.1 Flash-Lite. SHA-256 of both files + raw Gemini response bodies (HTTP 400 on the vanilla side, HTTP 200 on the deduped side) are in the PDF here:

https://github.com/corbenicai/merlin-community/blob/main/docs/benchmarks/langchain_2026-05-14.pdf

Curious how others handle this:

- Custom `BaseMemory` subclass that dedupes the rendered string?

- Switch to `ConversationSummaryMemory` and accept the LLM-as-summarizer

cost / latency?

- Manual `keep_last_n_messages` window (loses earlier context)?

- Move to checkpointed agent (LangGraph) and skip ConversationChain

altogether?

- Something else I'm missing?

What I ended up doing is a small `BaseMemory` subclass that strips byte-identical duplicate lines from the rendered history string before each LLM call (no summarization, no semantic compression just exact-line dedup, so it's deterministic). It inherits from `langchain_classic.base_memory.BaseMemory` so Pydantic validation in `Chain.memory` slots accepts it. When the underlying engine isn't available it transparently falls back to vanilla LangChain behavior with a one-line warning.

Result on the same 6-turn run: vanilla crashes turn 2, mine survives all 6. Same Gemini call returns 200. Code (MIT) + reproducible

benchmark script:

https://github.com/corbenicai/merlin-community/tree/main/integrations/langchain

Genuinely curious about other patterns people are using especially for very long-running agents where my 1-hour fallback retry might be too coarse.

u/MindPsychological140 — 6 days ago
▲ 0 r/devops

Open-source byte-exact dedup tool — EOSE Labs picked it up as a Terraform state hygiene layer for their sovereign language fleet

Disclosure: I'm the author. Posting because the use case isn't one I would've pitched to r/devops on my own, but a production team picked it up for exactly that and figured this sub might find it useful.

The case study

EOSE Labs (pemos.ca/pemgraphs) built a sovereign infrastructure architecture they call PEMGRAPHS — eight language nodes (Python, Rust, Go, TypeScript, Bash, Perl, Lean 4 for formal verification, HCL/Terraform), AKS + Azure Key Vault, multi-tier ADA vault secret rotation, 3,051-theorem Lean 4 proof engine. They're using my dedup tool (Merlin) at the Terraform state node and the Python v4 lineage tier as a hygiene primitive.

Their public shoutout: "Corbenic didn't know they were unlocking sovereign fleet context optimization. They were just building really good dedup." pemos.ca/community-shoutout

What the tool is

  • Byte-exact line-level dedup using xxHash3-64. Deterministic, lossless.
  • Single-threaded, ~250 KB Windows x64 binary, statically linked (no msys2 / runtime deps).
  • 1.10 µs median latency per chunk (their measurement, not mine).
  • MIT-licensed integrations: CLI, MCP server, HTTP proxy.
  • Hard-enforced caps in community tier: 50 MB/run · 200 MB/day · 2 GB/month. Refuses oversized work cleanly.

Why DevOps might care (beyond AI workloads)

  • Terraform state files balloon fast in large infra. EOSE uses it as a hygiene layer at the HCL/Terraform node.
  • CI/CD log pipelines accumulate duplicate noise. ./merlin-lite input.log --output-dedup=clean.log is a standalone primitive.
  • Anything where you process append-heavy streams and want a fast pre-archival pass.

Install

curl -LO https://github.com/corbenicai/merlin-community/releases/latest/download/merlin-community.zip
unzip merlin-community.zip && cd merlin-community
python shared/install_helpers.py <integration> enable

Or just use the binary standalone positional input file + --output-dedup=PATH.

Honest caveats

  • Line-level exact-match. Not semantic, not fuzzy. If you need near-duplicate detection on rephrased content, this isn't it.
  • Open-core: there's a closed-source Pro engine for high-throughput servers. What's in the public repo is what runs in the community edition same MCP interface, same caps.
  • Windows x64 binary in v0.2.1 release. Linux + macOS coming once cross-platform CI is up.

Repo: github.com/corbenicai/merlin-community

Zero telemetry. Curious if anyone here has Terraform state hygiene pain that this might fit, or if EOSE's pattern (using a dedup primitive as a building block in a bigger system) maps to something you're already doing.

reddit.com
u/MindPsychological140 — 7 days ago

Following #5563 measured 22-71% chunk duplication across 22M passages (2 arXiv papers). MCP dedup tool, MIT-licensed. Doesn't pretend to fix the cross-session replay

Disclosure: I'm the author of Merlin (MIT-licensed MCP dedup tool). Posting here because the token-burn conversations on this sub and in #5563 / #10585 overlap with what I've been measuring.

Scope first, so this doesn't read as a vendor pitch:

The 89% conversation-level waste in #5563 is cross-session full-history replay. That's Hermes-internal architecture only the orchestration layer (Entroly's #10585 proposal, or persistent prompt caching across session boundaries) can fix it. An external transport-layer tool cannot solve that.

What an external tool can catch is the WITHIN-call duplication each individual session payload (the 170KB → 728KB ones in the field report) still contains repeated chunks internally: system prompts re-stated, file contents quoted multiple times, tool definitions listed twice, etc.

Measured across 22 million LLM context passages from real agent sessions and RAG pipelines:

  • ~22% duplicate on typical agent payloads
  • Up to 71% on RAG-heavy ones

Papers:

Merlin is a small MCP server stdio JSON-RPC, three tools: merlin_dedupe, merlin_dedupe_file, merlin_savings_summary. Same MCP protocol Hermes already speaks for hermes mcp serve (#10835), so it slots in alongside the existing tool surface.

Caps are hard-enforced; community edition is the only thing in this post.

Repo: github.com/corbenicai/merlin-community — MIT, ~250 KB Windows x64 binary in the latest release. Linux + macOS in flight.

We collect zero telemetry GitHub stars are the only adoption signal we get.

u/MindPsychological140 — 7 days ago
▲ 0 r/cursor

22-71% of your AI coding input tokens are duplicates, we measured it across 22M passages (2 arXiv papers). Just shipped MCP support for Cursor

Disclosure first: I'm the author. MIT, runs locally, zero telemetry.

The finding:

We measured chunk-level redundancy across 22 million LLM context passages from real agent sessions and RAG pipelines:

~22% duplicate on typical agent loops — file contents re-quoted, system prompts re-sent, tool results restated across turns

Up to 71% on RAG-heavy workflows — vector-retrieved chunks overlap massively across consecutive queries

Papers + corpus:

arXiv:2605.09611 — architecture

arXiv:2605.09990 — empirical analysis (the 22M-passage measurement)

Zenodo DOI: 10.5281/zenodo.20090991

Why your Cursor bill grows faster than your codebase:

Every duplicate chunk is input tokens you're billed for. Anthropic's prompt caching helps when the prefix is stable, but the moment your context tail mutates (which it does — tool results, agent reasoning, file diffs), the repeated bits eat full input cost on every turn.

Merlin small MCP server that intercepts chunk-level repetition before content reaches the model. Standard stdio JSON-RPC. Three MCP tools: merlin_dedupe, merlin_dedupe_file, merlin_savings_summary.

Install for Cursor (v0.1.3, just shipped):

curl -LO https://github.com/corbenicai/merlin-community/releases/latest/download/merlin-community.zip

unzip merlin-community.zip && cd merlin-community

python shared/install_helpers.py cursor enable

Writes a stdio MCP entry to ~/.cursor/mcp.json. Restart Cursor. Same one-command install path that Claude Desktop / Claude Code / OpenClaw users already had — Cursor is the new addition in v0.1.3.

Honest caveats:

Community tier hard-enforces caps: 50 MB/run, 200 MB/day, 2 GB/month. The community binary refuses over-quota work with a clear message — verified yesterday on a 51 MB test file. Hobbyist usage doesn't hit these; sustained commercial pipelines do in 2-3 days.

Doesn't fix cross-session cache invalidation. If your agent loop fragments sessions and replays full history (looking at you, fragmented Hermes sessions), that's an orchestration problem above Merlin's MCP layer not solvable on the wire.

The C++ engine is closed-source and Pro-only (separate product). Community edition is the simpler same-interface build, ~250 KB Windows x64 binary in the release. Linux + macOS in flight.

Repo: github.com/corbenicai/merlin-community

Zero telemetry. GitHub stars are literally the only signal we get that someone's using this if you find it useful, ⭐ helps. Issues tracker is open for feature requests, bug reports, and honest critique.

[EDIT: CORRECTION ON HOW THIS WORKS IN CURSOR - See comments. I incorrectly framed this as an automatic context proxy, but it is an MCP tool that requires explicit LLM calls. It does NOT automatically reduce Cursor's internal context bills.]

reddit.com
u/MindPsychological140 — 7 days ago

Built an MCP dedup tool for Cowork, sub-agent context overlap was burning ~25% of my input tokens

Been running Cowork for a few weeks on a research project (3-4 sub-agents in parallel, shared planning doc, long-running). Noticed my input token bills were growing way faster than the actual scope of the work. Reason: sub-agents kept seeing overlapping context — the planning doc, the same file excerpts, shared user instructions — every turn.

Built a small local-first MCP tool (Merlin) that dedupes chunks before they hit the model. Measured ~25% chunk-level dedup on a typical Cowork session, up to 71% on RAG-heavy workflows. That's wasted input I was already paying for.

The community edition is MIT-licensed, runs locally (no cloud, zero telemetry), and exposes two MCP tools (merlin_dedupe, merlin_savings_summary) plus a $ saved indicator. Install: python shared/install_helpers.py claude_desktop enable, restart Claude Desktop, done.

Repo: github.com/corbenicai/merlin-community

Honest caveats:

  • C++ engine is closed-source and Pro-only (separate product). Community edition is a simpler single-threaded build — fine for most individual workloads.
  • Caps at 50 MB/run, 200 MB/day, 2 GB/month for the community tier.
  • v0.1.2, Windows x64 binary in releases. Linux + macOS follow.

Curious if anyone else here has measured their Cowork token overhead, would love to know if my 25% number generalizes.

reddit.com
u/MindPsychological140 — 7 days ago

22M-passage analysis: 22-71% of LLM context is redundant (arXiv papers + open-source implementation released)

Just published two arXiv preprints analyzing context redundancy inproduction LLM pipelines, along with an open-source C++ implementation.

**Headline finding**: Across 22.2M passages from real-world LLM workloads

(agent sessions, RAG pipelines, long conversations), 22-71% of the

context sent to the LLM is byte-level duplicate. You pay for that on

every API call.

**Papers**:

- Empirical analysis (22M passages): https://arxiv.org/abs/2605.09990

- Engine architecture: https://arxiv.org/abs/2605.09611

**What we built**:

A deterministic, byte-exact deduplication engine (Merlin) that stripsduplicate chunks before the LLM call. 100% mathematical equivalence toa reference Python `set()` operation, verified across the full corpus.Implemented in C++ to bypass GIL/GC overhead of the standard Pythonapproach.

**Performance**:

- 244 KB binary, only Windows system DLLs as runtime deps

- Independent integrators (EOSE Labs) measured ~1µs median in-process

latency on consumer hardware

- 100% local — verifiable with `strings binary | grep -i http`

(returns nothing)

**Open source release**:

- **MIT-licensed**: integration glue (MCP server, VSCode extension,

Claude Code hook, install scripts for Claude Desktop / Claude Code / OpenClaw)

- **Free Windows binary**: community-tier engine within caps (50 MB/run · 200 MB/day · 2 GB/month)

- Pro tier with multi-threaded engine is separate

**Repo**: https://github.com/corbenicai/merlin-community

**Day-1 adoption**: Within 24 hours of public release, an external team (EOSE Labs) integrated it into their production pipeline and published

benchmarks: https://pemos.ca/pemgraphs

**Discussion welcome on**:

- Deterministic byte-exact dedup vs probabilistic (MinHash/LSH) for

context filtering

- How others are measuring context redundancy in production

- Stack-fit with existing RAG/agent setups

reddit.com
u/MindPsychological140 — 7 days ago
▲ 0 r/ResearchML+1 crossposts

Hi everyone,I’m Sietse Schelpe, an independent researcher from Belgium working on AI infrastructure.I have finished two companion papers about practical efficiency improvements for large language model inference and retrieval-augmented generation pipelines.

The work is pre-registered, uses public benchmarks, and focuses on making real-world LLM serving more efficient while keeping full quality.

Because this is my first time submitting to arXiv, I’m looking for someone with endorsement rights in cs.LG who would be willing to endorse .

The link is here:
https://arxiv.org/auth/endorse?x=7K7DOH

If the link doesn’t work, just go to arxiv.org/auth/endorse.php and enter code 7K7DOH

I would really appreciate any help .Thank you so much for your time, and I’m open to any feedback regards,
Sietse

reddit.com
u/MindPsychological140 — 12 days ago