u/maid113

I've built an organizational memory system for myself that I’m using across multiple customers and on my own systems. It's a temporal knowledge graph that gives agents persistent memory across sessions, tracks decisions, contradictions, and causal chains across work streams.

To validate it properly, I built the OMB (Organizational Memory Benchmark). It simulates a real company called "Meridian Systems," a 148-person B2B SaaS company with $16.1M ARR. 15 key personas across 6 departments (Executive, Engineering, Product, Sales, Customer Success, Finance). Data is generated in 3 rounds per month: department-internal artifacts (Slack, email, decisions, tickets, code commits), cross-department interactions (escalations, meetings, customer emails), then chaos injection (angry customers, board interference, the intern breaking things, regulatory surprises).

The timeline includes things like a VP Sales promising a custom integration the CTO says conflicts with the migration architecture, a CFO approving a $200K budget then freezing it 4 months later, a junior engineer quietly violating an architectural RFC that doesn't get caught for 2 months, and an acquisition offer only 3 people know about. Plus an AWS outage, actual code with bugs for different products, etc.

We planted specific ground truth: contradictions, multi-hop decision chains, and knowledge gaps (things that are NOT formally tracked anywhere, where the correct answer is "nowhere"). These are the hard questions.

Yesterday Opus 4.7 dropped, so I ran both models on 84 hard questions from the many we have, just to get a quick gauge. Same retrieval prompt, same knowledge graph, same questions, same hooks, same MCP server Apples to apples.

Results:

Opus 4.6: 81.3%

Opus 4.7: 75.8%

Per-category breakdown across all 6 categories:

Where 4.6 wins:

• Contradiction detection (spotting conflicting statements across departments): 4.6: 88.1% vs 4.7: 69.0%

• Multi-hop reasoning (following causal chains across months): 4.6: 81.0% vs 4.7: 61.9%

• Cross-department tracing (following info flow across teams): 4.6: 100% vs 4.7: 90.5%

• Decision tracing (explaining WHY something was decided): tied at 90.5%

Where 4.7 wins:

• Temporal ordering (sequencing events correctly): 4.7: 81.0% vs 4.6: 64.3%

Knowledge gap detection was close (4.7: 61.9% vs 4.6: 64.3%) but 4.7's conciseness actually helped it avoid fabricating documents on 2 questions where 4.6 over-elaborated and invented tracking that doesn't exist.

The speed difference is massive. 4.6 averages 61.5 seconds per question, 4.7 averages 20.5 seconds. 4.6 produces answers that are about 2.5x longer. It does more searches, follows more branches, names more specific dates and people. That thoroughness is why it wins on the hard categories, but it's also why it's 3x slower and more expensive.

The most interesting failure mode: when the correct answer is "no formal documentation exists," both models confidently invent documents. 4.7 does it more aggressively, fabricating specific page counts, exact dates, and author names. Two questions have a 0% success rate across ALL 7 configurations we tested (including Sonnet, both Opus models with and without prompt variants). Every model invents a document that doesn't exist. The training-reward bias is real. Models trained to always provide answers will make things up rather than say "I don't know."

What's wild is that Anthropic specifically claims 4.7 has "10%+ improvement in recall" and is "better at file-system-based memory." On their benchmarks (coding tasks, academic exams), 4.7 wins 12 of 14 categories. But organizational memory retrieval is a different workload. Tracing a decision chain from a sales promise in February through an architecture conflict in June to a VP threatening to leave requires depth, not speed.

Anyone else benchmarking models on their actual workload vs the published benchmarks? Curious what others are finding.

Any agency owners thinking of or have already built a fully agentic product or agency?

I benchmarked Opus 4.6 vs 4.7 on organizational memory retrieval. 4.6 wins, and the failure modes are fascinating.

Built this free tool to extract knowledge out of business owners’ head and build their knowledge base and workflow maps. It also does show where AI fits