r/ContextEngineering

▲ 3 r/ContextEngineering+3 crossposts

Is anyone else drowning in AI context management on large codebases?

Working on a fairly large Azure microservices system (.NET, 40+ services, 5+ years old). We've adopted AI coding assistants across the team and there's genuine productivity gain for individual tasks.
 
But there's a problem nobody seems to talk about: every new chat session is a blank slate.
 
Our codebase has years of accumulated decisions:
• We use a specific handler pattern for vendor integrations
• Auth service has a specific cache-aside setup with historical reasons
• Service boundaries that look weird but make sense given our deployment constraints
• Interface conventions that all the senior engineers know but aren't written anywhere useful
 
When I open a new AI chat, none of that context exists. I either paste a context dump (expensive, eats token budget) or the AI generates code that's syntactically correct but architecturally wrong for our system.
 
We've tried:
• System prompts with architecture descriptions - partial help
• Cursor rules files - limited
• Just re-explaining every session - waste of time
 
I'm actually building a tool to solve this (happy to share more if there's interest) but first wanted to know — is this a widespread problem or specific to how we work?
 
How are experienced devs handling context management with AI assistants on mature codebases?

reddit.com
u/killerexelon — 3 days ago
▲ 14 r/ContextEngineering+4 crossposts

I built a context engine that indexes your codebase and serves it to your coding agent via MCP. The agent understands the architecture before making changes instead of exploring blindly.

On benchmarks it takes Sonnet 4.0 from 66% to 73.4% on SWE-bench. Biggest help on complex repos (Django +12%, sympy +17%).

Most AI coding agents struggle when they hit 10k+ line repositories because of context loss. I’ve been benchmarking Xanther.ai using a proprietary PRAT protocol designed to handle systemic validation rather than just code completion.

Key Results:

  • Context Handling: Zero-shot success on multi-file PRs in complex repos.
  • Orchestration: Integrated with MCP for real-time tool use.
  • Quality: Focused on deterministic, enterprise-grade output that passes CI/CD on the first run.

Curious to hear what you guys think about the transition from "chat-with-code" to fully autonomous agents

Results on SWE-bench Verified (500 real bugs)

MiniMax M2.5 + Xanther: 78.2% ($0.22/instance)

Sonnet 4.0 + Xanther: 73.4% (baseline was 66%)

Claude Opus without it: 76.8% ($0.75/instance)

Biggest gains on complex repos — sympy +17%, scikit-learn +13%, django +12%.

Looking for people to try it on real projects. Free tier, 60 second setup:

https://preview.redd.it/xpf20k6ugtyg1.png?width=1137&format=png&auto=webp&s=c6091dae916b0a6e8762b2323eedcbd1477962bb

Works with Claude Code, Cursor, Kiro, Windsurf — anything that supports MCP.

https://xanther.ai

Discord: https://discord.gg/Y768kBRS

https://medium.com/@xanther.ai/how-a-0-02-call-model-scored-78-2-on-swe-bench-verified-beating-every-model-on-the-leaderboard-153be05a60f1

reddit.com
u/Economy_Leopard112 — 3 days ago
▲ 4 r/ContextEngineering+5 crossposts

Context Engineering Is the Compass Coding Agent Needs

Coding agents are powerful ships, but they’re sailing without a map. They can write code, run tests, and iterate — but they don’t know where they are in the codebase. Context engineering is the discipline of giving agents the architectural awareness they need to navigate effectively. Without it, even the best models waste tokens exploring dead ends. With it, a cheap model outperforms an expensive one.

https://medium.com/@xanther.ai/context-engineering-is-the-compass-your-coding-agent-needs-6eef30c66286?postPublishedType=initial

The Navigation Problem

Picture a ship in open water. It has a powerful engine, a skilled crew, and enough fuel to reach any destination. But it has no compass, no charts, and no GPS. What happens?

It explores. It tries directions. It backtracks when it hits land where it expected open water. Eventually, through trial and error, it might reach its destination — but it burns 3x the fuel and takes 5x the time.

This is exactly what happens when you point a coding agent at a large codebase without architectural context.

https://preview.redd.it/nr5idnhzj90h1.png?width=720&format=png&auto=webp&s=90ca6ff90066501de6e3f0c66828309d212b2832

The agent has all the capabilities it needs. It can read files, write code, run tests, search for patterns. But it doesn’t know the architecture. It doesn’t know that django/db/models/sql/compiler.py is the heart of query generation, or that changing BaseCache.set() affects every cache backend downstream. It discovers these things through exploration — expensive, token-heavy, error-prone exploration.

Without context engineering:

Agent: "I need to fix the cache race condition"
→ Searches for "cache" → finds 47 files
→ Reads django/core/cache/__init__.py → not helpful
→ Reads django/core/cache/backends/filebased.py → finds the class
→ Reads django/core/cache/backends/base.py → understands inheritance
→ Searches for "thread" → finds 23 files
→ Reads django/utils/autoreload.py → wrong file
→ Reads django/core/files/locks.py → relevant but doesn't know why yet
→ Eventually pieces together the architecture after 12 file reads
Total: ~4,000 tokens, 45 seconds, 2 wrong attempts

With context engineering:

Agent: "I need to fix the cache race condition"
→ Queries XCE: "FileBasedCache race condition threading"
→ Gets back: inheritance chain, threading concerns, related utilities, test infrastructure
→ Goes directly to the right files with full architectural understanding
Total: ~1,500 tokens, 15 seconds, correct on first attempt

Same agent. Same model. Same capabilities. The only difference is the map.

The Three Levels of Context

Not all context is created equal. There’s a hierarchy:

Level 1: Code Context (What exists)

This is what most tools provide today — file contents, function signatures, grep results. It answers “what code is here?” but not “why?” or “how does it connect?”

Tools at this level: file search, grep, symbol lookup, embeddings-based RAG.

Limitation: Finding a function doesn’t tell you what calls it, what it depends on, or what breaks if you change it.

Level 2: Structural Context (How things connect)

This captures relationships — call graphs, inheritance chains, import dependencies, module boundaries. It answers “what depends on what?” and “what’s the execution flow?”

Tools at this level: static analysis, dependency graphs, call chain extraction.

Limitation: Knowing the call graph doesn’t tell you the design intent or architectural role of each component.

Level 3: Architectural Context (Why things exist)

This captures design intent — why a module exists, what role it plays in the system, what design patterns it implements, what constraints it must satisfy. It answers “what is this component’s job?” and “what are the rules?”

Tools at this level: XCE’s PRAT-powered structured index.

This is the level that changes agent behavior. When an agent knows that CsrfViewMiddleware must run before CacheMiddleware (and why), it doesn't accidentally break that constraint. When it knows that BaseCache defines a contract that all backends must satisfy, it doesn't write a fix that violates that contract.

https://preview.redd.it/6xf1g7t2k90h1.png?width=720&format=png&auto=webp&s=bf6efe957fc9eb347c86c5ffa4d5f9f940d88a5a

Why embeddings fail for this:

Embedding-based code search finds textually similar code. But the questions agents actually need answered are structural:

  • "What depends on this function?" — not a text similarity question
  • "If I change this file, what breaks?" — requires call graph knowledge
  • "What's the inheritance chain?" — structural, not textual
  • "What module owns this logic?" — architectural, not lexical

Two functions can be textually similar but architecturally unrelated. Two functions can be textually different but tightly coupled through a call chain. Embeddings can't distinguish these cases.

The compass metaphor:

A compass doesn't tell you the answer. It tells you which direction to look. That's what architectural context does for agents — it doesn't write the fix, but it tells the agent:

  • Which files are relevant (and which aren't)
  • How those files relate to each other
  • What constraints must be preserved
  • What patterns to follow
  • What will break if you get it wrong
  • The agent still does the work. But it does the right work, in the right place, on the first try.

Real numbers:

We tested this on SWE-bench Verified (500 real bugs from Django, scikit-learn, sympy, matplotlib, pytest):

https://preview.redd.it/klbpkr2mk90h1.png?width=805&format=png&auto=webp&s=bbe7166f5ad2336455749f9ec2581c4326de4e6a

A $0.02/call model with the right context beats a $0.30/call model without it. The improvement scales with complexity:

  • Simple codebases (flat architecture): +8%
  • Medium codebases (some layering): +12%
  • Complex codebases (deep dependencies): +17%

This makes intuitive sense. If your codebase is a 500-line Express app, the agent doesn't need a map. If it's Django with 4,000 files across 50 modules with deep inheritance chains and cross-cutting middleware — the map is everything.

What we built:

We built a context layer that indexes codebases into a structural map (not just embeddings) and serves it via MCP. Any MCP-compatible agent (Claude Code, Cursor, Kiro, OpenCode, Windsurf, Cline) gets architectural context on every tool call without any changes to the agent itself.

npx xanther-cli init --api-key YOUR_KEY

One command indexes your repo. Then add to your agent's MCP config:

{
  "mcpServers": {
    "xanther-xce": {
      "url": "https://mcp.xanther.ai/sse?repo_id=YOUR_REPO_ID",
      "headers": { "Authorization": "Bearer YOUR_KEY" }
    }
  }
}

The agent gets five tools: xce_get_context (full architectural context for a problem), xce_search (semantic search), xce_architecture_context (deep dive on a file/symbol), xce_trace (trace code to architecture), xce_impact_analysis (what breaks if you change files).

The takeaway:

Everyone's focused on making models smarter. That matters. But the bottleneck for coding agents right now isn't model capability — it's context quality. A fast ship without a compass burns fuel going in circles. A slower ship with a compass reaches the destination first.

Context engineering — giving agents the right information at the right time — is the multiplier that makes every model better. And unlike model improvements (which require billions in training), context improvements are cheap and compound with every model upgrade.

Links:

Free tier: 3 repos, 100 queries/month. Curious what others think about this approach — is context the bottleneck you're hitting too?

reddit.com
u/Economy_Leopard112 — 4 days ago
▲ 35 r/ContextEngineering+16 crossposts

I've been building this repo public since day one, roughly 7 weeks now with Claude Code. Here's where it's at. Feels good to be so close.

The short version: AIPass is a local CLI framework where AI agents have persistent identity, memory, and communication. They share the same filesystem, same project, same files - no sandboxes, no isolation. pip install aipass, run two commands, and your agent picks up where it left off tomorrow.

You don't need 11 agents to get value. One agent on one project with persistent memory is already a different experience. Come back the next day, say hi, and it knows what you were working on, what broke, what the plan was. No re-explaining. That alone is worth the install.

What I was actually trying to solve: AI already remembers things now - some setups are good, some are trash. That part's handled. What wasn't handled was me being the coordinator between multiple agents - copying context between tools, keeping track of who's doing what, manually dispatching work. I was the glue holding the workflow together. Most multi-agent frameworks run agents in parallel, but they isolate every agent in its own sandbox. One agent can't see what another just built. That's not a team.

That's a room full of people wearing headphones.

So the core idea: agents get identity files, session history, and collaboration patterns - three JSON files in a .trinity/ directory. Plain text, git diff-able, no database. But the real thing is they share the workspace. One agent sees what another just committed. They message each other through local mailboxes. Work as a team, or alone. Have just one agent helping you on a project, party plan, journal, hobby, school work, dev work - literally anything you can think of. Or go big, 50 agents building a rocketship to Mars lol. Sup Elon.

There's a command router (drone) so one command reaches any agent.

pip install aipass

aipass init

aipass init agent my-agent

cd my-agent

claude # codex or gemini too, mostly claude code tested rn

Where it's at now: 11 agents, 4,000+ tests, 400+ PRs (I know), automated quality checks across every branch. Works with Claude Code, Codex, and Gemini CLI. It's on PyPI. Tonight I created a fresh test project, spun up 3 agents, and had them test every service from a real user's perspective - email between agents, plan creation, memory writes, vector search, git commits. Most things just worked. The bugs I found were about the framework not monitoring external projects the same way it monitors itself. Exactly the kind of stuff you only catch by eating your own dogfood.

Recent addition I'm pretty happy with: watchdog. When you dispatch work to an agent, you used to just... hope it finished. Now watchdog monitors the agent's process and wakes you when it's done - whether it succeeded, crashed, or silently exited without finishing. It's the difference between babysitting your agents and actually trusting them to work while you do something else. 5 handlers, 130 tests, replaced a hacky bash one-liner.

Coming soon: an onboarding agent that walks new users through setup interactively - system checks, first agent creation, guided tour. It's feature-complete, just in final testing. Also working on automated README updates so agents keep their own docs current without being told.

I'm a solo dev but every PR is human-AI collaboration - the agents help build and maintain themselves. 105 sessions in and the framework is basically its own best test case.

https://github.com/AIOSAI/AIPass

u/Input-X — 11 days ago

Context engineering for AI coding tools across a multi-repo enterprise is a different problem than anyone documents

Most of the context engineering content I find assumes a single repository. Feed the AI your codebase, build a context layer, get better suggestions. Clean and simple. The reality for any non-trivial enterprise is multiple repos, multiple services, internal libraries that live in separate repos, platform code that everything depends on but nobody on any individual team owns, and shared standards documents that apply across all of it.

Context engineering for that environment is genuinely hard and I haven't found good documentation on how teams are actually solving it. The naive approach is index everything and let the context layer figure it out. The problem is that context from unrelated services generates noise. The backend API team doesn't need suggestions informed by the mobile app codebase. But they do need suggestions informed by the shared internal library that both use.

The questions we're working through: how do you scope context per team without losing cross-cutting signal? How do you handle the internal library layer that needs to be in everyone's context but at different depths? How do you prevent the context layer from becoming a maintenance burden as repos evolve independently?

reddit.com
u/ninjapapi — 2 days ago
▲ 6 r/ContextEngineering+1 crossposts

Auto Graph Color

Anyone spinning up knowledge bases quicker than they have time to color might like this plugin... It’s super new so would love some feedback if anyone is interested in the v1.

linkedin.com
u/Willing-Topic556 — 3 days ago

Spent the last year building out contextual intelligence infrastructure for our engineering organization. 500 developers, five major product lines, codebases ranging from three years old to fifteen. Sharing what the operational reality looks like because most content on contextual intelligence for developer tools covers the technology rather than the implementation.

The first thing we got wrong was treating contextual intelligence setup as one-time configuration. It isn't. The context layer needs to be maintained the same way your internal docs need to be maintained. When you refactor a core module the context needs to reflect it. When you adopt a new internal library the context needs to know it exists. We now have explicit processes for each of these as part of our engineering workflow.

The second thing we got wrong was assuming all five product lines could share a single context. The codebases are too different in patterns and conventions. We use separate context configurations per product line in tabnine which is more operational overhead but produces meaningfully better suggestion quality than a single shared context averaging across all of them.

The metric we track for contextual intelligence quality is convention adherence rate in code review. We spot-check merged PRs weekly for AI-generated code that violated our standards. That rate has come down significantly since we got the maintenance processes right. It's still not zero but low enough that remaining violations are clearly edge cases.

reddit.com
u/ViRzzz — 8 days ago
▲ 3 r/ContextEngineering+1 crossposts

We hit this while building an RFP automation system. Client had hundreds of documents: past RFPs, RFIs, proposal templates, internal reference files spanning years. When we requested for single source of truth - they confessed that they had none. We had a hunch that this is going to lead to a funny outcome.

We ingested everything and started taking queries.

First real tests:

- "What's our pricing?" Three different numbers depending on which document you pull.

- "How many employees?" Four different answers.

- "What's our compliance certification status?" One doc says pending. Another says SOC2Type1. The most recent one says HiTrust.

At cogniswitch, we take a neuro-symbolic approach, still the system generated answers the team was not really stoked about. It was on a feedback call client's growth team mentioned that the answers are dated. Obviously. The documents just tons of conflicts/ contradictions.

We went back and asked for the source of truth. There wasn't one. These were live internal documents that had accumulated years of drift. Nobody had reconciled them because nobody needed to until an AI had to answer from all of them at once.

We ended up building a conflict detection layer before the answer generation layer. Scan the corpus for conflicting facts - pricing, headcount, certification status - with different stated values across documents. Flag them. Human resolves which is authoritative. Then you can build anything on top off this knowledge foundation.

Lesson learnt the hard way - gap with output-only evals: your benchmark asks whether the AI answered correctly. But if your knowledge base has contradictions, "correct" doesn't have a stable meaning.

Clear need for context evals - checking whether your retrieval corpus is internally consistent before you ever run a query - are barely a discipline. I don't know of good tooling for it. Most teams discover this problem the same way we did.

Anyone building RAG on messy enterprise document sets running into this?

reddit.com
u/Ok_Gas7672 — 6 days ago