u/FoxFire17739

How to properly benchmark a context/memory solution

How to properly benchmark a context/memory solution

I want to benchmark my own memory tool. What I did so far was a bunch of runs in codex headless mode using --json.

https://developers.openai.com/codex/noninteractive

You can fire prompt and everything is recorded end-to-end. How many tool calls. What was called, the inputs and outputs. How long the prompt took. And how many tokens got consumed.

For small codebases under 100 files of code I know my tool loses against vanilla. And the answers were of the same quality.

But when I ran it on a 350 file codebase codex using my memory layer outperformed vanilla in performance and quality of the response. The prompt was about discovery and figuring out the architecture.

What I did expect to happen was only that the answers would be better. I had expected that there will be always a tax because my system banks on sidecar files where every code file has it's own side car that you can find with the same path just in a parallel folder.

What was funky is the README.md. In the case with 350 files the file was mostly correct and should be a bigger help for codex that couldn't rely on the memory layer. But it still at several points in my code jumped to the wrong conclusions and said that an old code path is the mature current one. That was really weird. I took the README.md out and of course same issue.

And no matter how often I ran that it would stubbornly take the wrong path and say the outdated path is the right one. Codex using my nemory knew every single time what the correct path is. When it gets to the old code parts it "finds" a note right beside that tells that this code is a dead end. The README.md might here already deeply buried in the context so it doesn't matter much. And I feel this is what helps it to reliable. So that part I know for sure.

But I don't know if I can trust the "performance" numbers. Sure the Codex tool measures deterministically. And the thing was faster with the analysis prompt. I could tell that without the tool. However it doesn't mean I can draw the right conclusions. I have a hint.

**So if you were in my shoes what would you test next and what tools would you use?**

I am certainly going to try a larger codebase from github and use older tickets that have been solved recently. And I will publish the artifacts and the github memory artifacts on a seperate github repo. So everyone can just download the memory and test it on that code repo themselves without the need to build one from scratch. I think that would make stuff repeatable for everyone.

But other than that I am open for suggestions regarding methodology.

For anyone interested you can check my repo here. It is still in alpha and there is still one mayor issue where I want to make the coordination folder the only runtime artifact. But this is an ergonomics thing. The memory system is fully operational.

https://github.com/Foxfire1st/agents-remember-md

u/FoxFire17739 — 15 hours ago

I build a self-documenting system that knowledge into team infrastructure

The problem with AI coding agents is not that they need more intelligence.

It’s that they work like a new teammate who never got onboarded.

The real project knowledge is rarely in one clean place. Often it is in 2 guys who are the companies last line of defense against a bad decision that is about to ruin everyones day.

So I tried to turn that knowledge into infrastructure.

Something that doesn't get stale. That documents while you focus on your tasks. That you can version and branch and even run in a different repo than your code. And it still stays true to the code. And most importantly something that is open to read, review, and learn from and not some black box that is only-clankers.

I pinned the memory repo in the README.md so you can see yourself if it would add value to your system.

u/FoxFire17739 — 5 days ago
▲ 3 r/coolgithubprojects+1 crossposts

I am software developer of 9 years and I wanted a system that allows agents to remember more reliably the quirks of a code file and how it relates to other parts of the application, even if the code itself doesn't directly connect there.

That's why I build this: Github-Repo

And there are so many little things that never fit into a single AGENTS.md file and you don't want to stuff it all there. Instead you want agents to only get the little piece that is relevant when they open a code file.

That is why I made this path based so that finding it is brain dead easy for the agent. The card also tracks the code files git commit hash so it's very easy for the agent to detect staleness. I have a dedicated skill that when fired checks all onboardings in less than a second for staleness. Creates a report and any drifted file gets updated before the agent ends up reading outdated stuff.

Once this is setup all of this works on the side without you doing anything other than working on your tasks. What you discuss with the agent in chat, what you explain and clarify now ends up up in these files instead of getting wasted. You explain once and never again.

u/FoxFire17739 — 13 days ago

Modern coding agents look superhuman one moment, then hit you with a divine stroke of idiocy the next.

On a small task, an AGENTS.md file, a few prompt rules, and a strong model can feel almost magical. That creates the illusion that the agent already “knows the codebase.” In larger systems, that illusion breaks. The agent does not actually know your architecture, your hidden invariants, your migration scars, or the strange rules everyone on the team has learned the hard way. It only knows what the repository makes legible.

That is why the failures are so weird. The output looks plausible. The edit is clean. The regression is real.

A single top-level instruction file can point the agent in the right direction, but it cannot reappear exactly when the agent needs it. Once the agent is deep in a file, the relevant context is no longer naturally in front of it. Recovering it becomes an explicit search problem: expensive, uncertain, and easy to skip.

It is like handing someone a city map at the train station and taking it away before they start walking. The problem is not that they never saw the map. The problem is that it is gone when they need one next turn.

That's where I started off with this simple premise: important project knowledge should not have to be hunted down. If it is not local, structured, and discoverable, then for the agent it effectively does not exist.

So the way forward is to make that missing context visible before the agent has to guess.

What do you think about this problem?

If you want to see my approach to the problem you can do so here.

https://github.com/Foxfire1st/agents-remember-md

u/FoxFire17739 — 14 days ago
▲ 1 r/u_FoxFire17739+1 crossposts

I made this system so my agents don't trip when they work on my multi-repo workspace. And I want them to figure out on their own how certain stuff is connected, instead of me repeating that stuff or stuff it all into a single markdown.

That's where my onboarding files come in. Whenever now my agents opens up a code file, a companion file pops up to the side telling them about all the bits that they wouldn't get just from reading code. It works. At least better than "Do no mistakes!" .

Let me know what you think.

https://github.com/Foxfire1st/agents-remember-md

u/FoxFire17739 — 15 days ago