u/Economy_Leopard112

Context Engineering Is the Compass Coding Agent Needs
▲ 4 r/ContextEngineering+5 crossposts

Context Engineering Is the Compass Coding Agent Needs

Coding agents are powerful ships, but they’re sailing without a map. They can write code, run tests, and iterate — but they don’t know where they are in the codebase. Context engineering is the discipline of giving agents the architectural awareness they need to navigate effectively. Without it, even the best models waste tokens exploring dead ends. With it, a cheap model outperforms an expensive one.

https://medium.com/@xanther.ai/context-engineering-is-the-compass-your-coding-agent-needs-6eef30c66286?postPublishedType=initial

The Navigation Problem

Picture a ship in open water. It has a powerful engine, a skilled crew, and enough fuel to reach any destination. But it has no compass, no charts, and no GPS. What happens?

It explores. It tries directions. It backtracks when it hits land where it expected open water. Eventually, through trial and error, it might reach its destination — but it burns 3x the fuel and takes 5x the time.

This is exactly what happens when you point a coding agent at a large codebase without architectural context.

https://preview.redd.it/nr5idnhzj90h1.png?width=720&format=png&auto=webp&s=90ca6ff90066501de6e3f0c66828309d212b2832

The agent has all the capabilities it needs. It can read files, write code, run tests, search for patterns. But it doesn’t know the architecture. It doesn’t know that django/db/models/sql/compiler.py is the heart of query generation, or that changing BaseCache.set() affects every cache backend downstream. It discovers these things through exploration — expensive, token-heavy, error-prone exploration.

Without context engineering:

Agent: "I need to fix the cache race condition"
→ Searches for "cache" → finds 47 files
→ Reads django/core/cache/__init__.py → not helpful
→ Reads django/core/cache/backends/filebased.py → finds the class
→ Reads django/core/cache/backends/base.py → understands inheritance
→ Searches for "thread" → finds 23 files
→ Reads django/utils/autoreload.py → wrong file
→ Reads django/core/files/locks.py → relevant but doesn't know why yet
→ Eventually pieces together the architecture after 12 file reads
Total: ~4,000 tokens, 45 seconds, 2 wrong attempts

With context engineering:

Agent: "I need to fix the cache race condition"
→ Queries XCE: "FileBasedCache race condition threading"
→ Gets back: inheritance chain, threading concerns, related utilities, test infrastructure
→ Goes directly to the right files with full architectural understanding
Total: ~1,500 tokens, 15 seconds, correct on first attempt

Same agent. Same model. Same capabilities. The only difference is the map.

The Three Levels of Context

Not all context is created equal. There’s a hierarchy:

Level 1: Code Context (What exists)

This is what most tools provide today — file contents, function signatures, grep results. It answers “what code is here?” but not “why?” or “how does it connect?”

Tools at this level: file search, grep, symbol lookup, embeddings-based RAG.

Limitation: Finding a function doesn’t tell you what calls it, what it depends on, or what breaks if you change it.

Level 2: Structural Context (How things connect)

This captures relationships — call graphs, inheritance chains, import dependencies, module boundaries. It answers “what depends on what?” and “what’s the execution flow?”

Tools at this level: static analysis, dependency graphs, call chain extraction.

Limitation: Knowing the call graph doesn’t tell you the design intent or architectural role of each component.

Level 3: Architectural Context (Why things exist)

This captures design intent — why a module exists, what role it plays in the system, what design patterns it implements, what constraints it must satisfy. It answers “what is this component’s job?” and “what are the rules?”

Tools at this level: XCE’s PRAT-powered structured index.

This is the level that changes agent behavior. When an agent knows that CsrfViewMiddleware must run before CacheMiddleware (and why), it doesn't accidentally break that constraint. When it knows that BaseCache defines a contract that all backends must satisfy, it doesn't write a fix that violates that contract.

https://preview.redd.it/6xf1g7t2k90h1.png?width=720&format=png&auto=webp&s=bf6efe957fc9eb347c86c5ffa4d5f9f940d88a5a

Why embeddings fail for this:

Embedding-based code search finds textually similar code. But the questions agents actually need answered are structural:

  • "What depends on this function?" — not a text similarity question
  • "If I change this file, what breaks?" — requires call graph knowledge
  • "What's the inheritance chain?" — structural, not textual
  • "What module owns this logic?" — architectural, not lexical

Two functions can be textually similar but architecturally unrelated. Two functions can be textually different but tightly coupled through a call chain. Embeddings can't distinguish these cases.

The compass metaphor:

A compass doesn't tell you the answer. It tells you which direction to look. That's what architectural context does for agents — it doesn't write the fix, but it tells the agent:

  • Which files are relevant (and which aren't)
  • How those files relate to each other
  • What constraints must be preserved
  • What patterns to follow
  • What will break if you get it wrong
  • The agent still does the work. But it does the right work, in the right place, on the first try.

Real numbers:

We tested this on SWE-bench Verified (500 real bugs from Django, scikit-learn, sympy, matplotlib, pytest):

https://preview.redd.it/klbpkr2mk90h1.png?width=805&format=png&auto=webp&s=bbe7166f5ad2336455749f9ec2581c4326de4e6a

A $0.02/call model with the right context beats a $0.30/call model without it. The improvement scales with complexity:

  • Simple codebases (flat architecture): +8%
  • Medium codebases (some layering): +12%
  • Complex codebases (deep dependencies): +17%

This makes intuitive sense. If your codebase is a 500-line Express app, the agent doesn't need a map. If it's Django with 4,000 files across 50 modules with deep inheritance chains and cross-cutting middleware — the map is everything.

What we built:

We built a context layer that indexes codebases into a structural map (not just embeddings) and serves it via MCP. Any MCP-compatible agent (Claude Code, Cursor, Kiro, OpenCode, Windsurf, Cline) gets architectural context on every tool call without any changes to the agent itself.

npx xanther-cli init --api-key YOUR_KEY

One command indexes your repo. Then add to your agent's MCP config:

{
  "mcpServers": {
    "xanther-xce": {
      "url": "https://mcp.xanther.ai/sse?repo_id=YOUR_REPO_ID",
      "headers": { "Authorization": "Bearer YOUR_KEY" }
    }
  }
}

The agent gets five tools: xce_get_context (full architectural context for a problem), xce_search (semantic search), xce_architecture_context (deep dive on a file/symbol), xce_trace (trace code to architecture), xce_impact_analysis (what breaks if you change files).

The takeaway:

Everyone's focused on making models smarter. That matters. But the bottleneck for coding agents right now isn't model capability — it's context quality. A fast ship without a compass burns fuel going in circles. A slower ship with a compass reaches the destination first.

Context engineering — giving agents the right information at the right time — is the multiplier that makes every model better. And unlike model improvements (which require billions in training), context improvements are cheap and compound with every model upgrade.

Links:

Free tier: 3 repos, 100 queries/month. Curious what others think about this approach — is context the bottleneck you're hitting too?

reddit.com
u/Economy_Leopard112 — 4 days ago
▲ 3 r/ContextEngineering+4 crossposts

Why your coding agent reads 12 files to fix a bug that needs 3 — and how to fix it

I've been digging into why AI coding agents burn so many tokens on what should be straightforward tasks. Traced through 500 real bugs from SWE-bench Verified (Django, scikit-learn, sympy, matplotlib, pytest) and found a consistent pattern: agents spend 30-40% of their tokens just figuring out where things are.

Here's a concrete example from Django (bug #16379 — FileBasedCache crashes with FileNotFoundError on concurrent access):

https://preview.redd.it/bgr9rdjlb90h1.png?width=1600&format=png&auto=webp&s=81527ffd8ecfe2860268e25afbfea334db835917

What a human developer does:

  1. Reads the issue — recognizes it's a race condition in the file cache backend
  2. Knows from experience that Django cache backends inherit from BaseCache
  3. Opens filebased.py directly, checks _write() for unprotected file ops
  4. Sees locks.py exists for exactly this purpose
  5. Writes fix. 3 files, 5 minutes.

What an AI agent does (without architectural context):

  1. Reads the issue
  2. Searches for "FileNotFoundError" — gets 47 matches across the codebase
  3. Opens storage.py — wrong file, wastes tokens
  4. Opens base.py — wrong file again
  5. Opens locks.py — relevant but doesn't know why yet
  6. Searches for "FileBasedCache" — finds it in filebased.py
  7. Reads the whole file but doesn't understand the BaseCache contract
  8. Writes a fix that catches FileNotFoundError but breaks cache invalidation
  9. Test fails
  10. Now opens base.py to understand the base class
  11. Opens __init__.py to understand the framework
  12. Opens cache.py — not needed but agent doesn't know that
  13. Finally understands the hierarchy, rewrites the fix
  14. Test passes. 12 files read, ~4,000 tokens, 2 attempts, 45 seconds.

What an agent does with structural context (via MCP):

  1. Reads the issue
  2. Calls xce_get_context("FileBasedCache FileNotFoundError concurrent access")
  3. Gets back a structured response:
    1. BaseCache → FileBasedCache inheritance chain
    2. The _write() method uses tempfile.mkstemp() + os.rename() (race window)
    3. locks.py exists for file locking
    4. Test infrastructure at tests.py
  4. Understands the full picture immediately
  5. Writes fix: wraps file op in try/except, uses proper locking pattern
  6. Test passes first try. 3 files read, ~1,500 tokens, 1 attempt, 15 seconds.

https://preview.redd.it/u0std3kfb90h1.png?width=1600&format=png&auto=webp&s=7b4996ebf5788e43bd0a8f65a52237b71c54307c

Why this happens:

The agent doesn't have a map. It doesn't know that FileBasedCache inherits from BaseCache. It doesn't know that locks.py is the utility for exactly this problem. It doesn't know that changing cache behavior requires respecting the BaseCache contract.So it explores. And most of that exploration is wasted.

The numbers across 500 instances:

Metric Without context With context Improvement
Avg files explored 12.3 4.1 fewer
Avg tokens per task 4,200 1800 reduction
First-attempt success 35% 72% better
Avg time to solution 45s 18s faster

The improvement scales with codebase complexity:

  • pytest (flat architecture): +8% resolve rate
  • django (layered MVC + ORM): +12%
  • scikit-learn (deep inheritance): +13%
  • sympy (cross-module dependencies): +17%

Makes sense — if your codebase is a simple Express app, the agent can figure it out. If it's a 300K-line framework with 50 modules and deep inheritance chains, it needs the map.

Why embeddings aren't enough:

Most code search tools use embedding similarity — convert code to vectors, find similar vectors. This works for "find the login function" but fails for architectural questions:

  • "What depends on this function?" — embeddings can't answer this
  • "If I change this file, what breaks?" — requires structural knowledge
  • "What's the inheritance chain?" — text similarity doesn't capture this

You need structural relationships, not text similarity.

What we built:

An MCP server that indexes codebases into a structural map and serves architectural context to any compatible agent. One command to set up:

npx xanther-cli init --api-key YOUR_KEY

Then add to your agent's MCP config:

{
  "mcpServers": {
    "xanther-xce": {
      "url": "https://mcp.xanther.ai/sse?repo_id=YOUR_REPO_ID",
      "headers": { "Authorization": "Bearer YOUR_KEY" }
    }
  }
}

Works with Claude Code, Cursor, Kiro, OpenCode, Windsurf, Cline — anything that supports MCP.

SWE-bench results:

  • MiniMax M2.5 + XCE: 78.2% (would be #1 on the official leaderboard, costs $0.22/instance)
  • Claude 4.5 Opus without context: 76.8% (costs $0.75/instance)
  • Sonnet 4.0 + XCE: 73.4% (vs 66% baseline — a 7.4 point jump)
  • A cheap model with the right context beats an expensive model without it.

Links:

Free tier: 3 repos, 100 queries/month. No credit card.

Happy to answer questions about the methodology or the architecture.

reddit.com
u/Economy_Leopard112 — 3 days ago
▲ 1 r/kiroIDE+1 crossposts

Why your coding agent reads 12 files to fix a bug that needs 3 — and how to fix it

Traced through a real Django bug (FileBasedCache race condition) to see what happens:

Without architectural context:

  • Agent searches for "FileBasedCache" — finds it
  • Reads 8 irrelevant files trying to understand the cache framework
  • Writes a fix that breaks the cache contract
  • Test fails
  • Reads 4 more files, finally understands the hierarchy
  • Rewrites fix. 12 files, 4,000 tokens, 2 attempts.

With architectural context (via MCP):

  • Agent queries for context on "FileBasedCache concurrent access"
  • Gets back: inheritance chain, locking utilities, race condition patterns
  • Writes correct fix immediately

https://preview.redd.it/upq8gd0s460h1.png?width=1600&format=png&auto=webp&s=34ce0fd7554d05be8fd931c0bc56bf5038829e42

3 files, 1,500 tokens, 1 attempt.

This pattern repeats across 500 SWE-bench instances. Agents waste 30-40% of tokens on exploration because they don't have a map of the codebase.

We built a tool that indexes your codebase and serves structural context to any MCP-compatible agent (OpenCode, Claude Code, Cursor, Kiro, Windsurf). Free tier: 3 repos, 100 queries/month.

Full analysis: https://medium.com/@xanther.ai/why-ai-coding-agents-waste-30-of-their-tokens-and-how-to-fix-it-4560ffd2cbb9

Try it: https://xanther.ai

npm: https://www.npmjs.com/package/xanther-cli

Discord: https://discord.gg/YaBekKpR

reddit.com
u/Economy_Leopard112 — 4 days ago

https://reddit.com/link/1t2asy0/video/hred5psiiuyg1/player

I built a context engine that indexes your codebase and serves it to your coding agent via MCP. The agent understands the architecture before making changes instead of exploring blindly.

On benchmarks it takes Sonnet 4.0 from 66% to 73.4% on SWE-bench. Biggest help on complex repos (Django +12%, sympy +17%).

Most AI coding agents struggle when they hit 10k+ line repositories because of context loss. I’ve been benchmarking Xanther.ai using a proprietary PRAT protocol designed to handle systemic validation rather than just code completion.

Key Results:

  • Context Handling: Zero-shot success on multi-file PRs in complex repos.
  • Orchestration: Integrated with MCP for real-time tool use.
  • Quality: Focused on deterministic, enterprise-grade output that passes CI/CD on the first run.

Curious to hear what you guys think about the transition from "chat-with-code" to fully autonomous agents

Results on SWE-bench Verified (500 real bugs)

MiniMax M2.5 + Xanther: 78.2% ($0.22/instance)

Sonnet 4.0 + Xanther: 73.4% (baseline was 66%)

Claude Opus without it: 76.8% ($0.75/instance)

Biggest gains on complex repos — sympy +17%, scikit-learn +13%, django +12%.

Looking for people to try it on real projects. Free tier, 60 second setup:
Works with Claude Code, Cursor, Kiro, Windsurf — anything that supports MCP.

https://xanther.ai

Discord: https://discord.gg/Y768kBRS

https://medium.com/@xanther.ai/how-a-0-02-call-model-scored-78-2-on-swe-bench-verified-beating-every-model-on-the-leaderboard-153be05a60f1

https://preview.redd.it/wwwj39jciuyg1.png?width=1080&format=png&auto=webp&s=c9188d2d170ec47fd07b5d67696bea2205392ba6

reddit.com
u/Economy_Leopard112 — 11 days ago

https://reddit.com/link/1t2ao6i/video/9uo920ekiuyg1/player

I built a context engine that indexes your codebase and serves it to your coding agent via MCP. The agent understands the architecture before making changes instead of exploring blindly.

On benchmarks it takes Sonnet 4.0 from 66% to 73.4% on SWE-bench. Biggest help on complex repos (Django +12%, sympy +17%).

Most AI coding agents struggle when they hit 10k+ line repositories because of context loss. I’ve been benchmarking Xanther.ai using a proprietary PRAT protocol designed to handle systemic validation rather than just code completion.

Key Results:

  • Context Handling: Zero-shot success on multi-file PRs in complex repos.
  • Orchestration: Integrated with MCP for real-time tool use.
  • Quality: Focused on deterministic, enterprise-grade output that passes CI/CD on the first run.

Curious to hear what you guys think about the transition from "chat-with-code" to fully autonomous agents

Results on SWE-bench Verified (500 real bugs)

MiniMax M2.5 + Xanther: 78.2% ($0.22/instance)

Sonnet 4.0 + Xanther: 73.4% (baseline was 66%)

Claude Opus without it: 76.8% ($0.75/instance)

Biggest gains on complex repos — sympy +17%, scikit-learn +13%, django +12%.

Looking for people to try it on real projects. Free tier, 60 second setup:
Works with Claude Code, Cursor, Kiro, Windsurf — anything that supports MCP.

https://xanther.ai

Discord: https://discord.gg/Y768kBRS

https://medium.com/@xanther.ai/how-a-0-02-call-model-scored-78-2-on-swe-bench-verified-beating-every-model-on-the-leaderboard-153be05a60f1

https://preview.redd.it/7uvdr2enhuyg1.png?width=1080&format=png&auto=webp&s=bc76a75a452a27ae325e3d68cadcc6d10a499e67

reddit.com
u/Economy_Leopard112 — 11 days ago
▲ 14 r/ContextEngineering+4 crossposts

I built a context engine that indexes your codebase and serves it to your coding agent via MCP. The agent understands the architecture before making changes instead of exploring blindly.

On benchmarks it takes Sonnet 4.0 from 66% to 73.4% on SWE-bench. Biggest help on complex repos (Django +12%, sympy +17%).

Most AI coding agents struggle when they hit 10k+ line repositories because of context loss. I’ve been benchmarking Xanther.ai using a proprietary PRAT protocol designed to handle systemic validation rather than just code completion.

Key Results:

  • Context Handling: Zero-shot success on multi-file PRs in complex repos.
  • Orchestration: Integrated with MCP for real-time tool use.
  • Quality: Focused on deterministic, enterprise-grade output that passes CI/CD on the first run.

Curious to hear what you guys think about the transition from "chat-with-code" to fully autonomous agents

Results on SWE-bench Verified (500 real bugs)

MiniMax M2.5 + Xanther: 78.2% ($0.22/instance)

Sonnet 4.0 + Xanther: 73.4% (baseline was 66%)

Claude Opus without it: 76.8% ($0.75/instance)

Biggest gains on complex repos — sympy +17%, scikit-learn +13%, django +12%.

Looking for people to try it on real projects. Free tier, 60 second setup:

https://preview.redd.it/xpf20k6ugtyg1.png?width=1137&format=png&auto=webp&s=c6091dae916b0a6e8762b2323eedcbd1477962bb

Works with Claude Code, Cursor, Kiro, Windsurf — anything that supports MCP.

https://xanther.ai

Discord: https://discord.gg/Y768kBRS

https://medium.com/@xanther.ai/how-a-0-02-call-model-scored-78-2-on-swe-bench-verified-beating-every-model-on-the-leaderboard-153be05a60f1

reddit.com
u/Economy_Leopard112 — 3 days ago