u/42cyy — reddlx

We're building a personal intelligence OS where memory is the foundation but the product is the experience layer on top. We're not in the same category as mem0, supermemory, or openmemory who are all building memory infrastructure for developers and doing genuinely great work in that space.

We run internal evals constantly to prevent regressions as we iterate (V0 to V1), test different model and architecture choices, and catch edge cases. But we haven't run public benchmarks like LongMemEval yet. The honest reason: we're a small team and the plan was to run public benchmarks closer to V1 when the architecture was more stable.

An investor recently asked for head-to-head LongMemEval results against mem0, supermemory, and openmemory before moving forward. Fair ask. We're going to do it. But it raised some questions I'd love this community's input on:

How are people approaching public evals while still in active development? Running them on a moving target seems wasteful, but waiting until "ready" can mean never running them.
Cost-effective approaches? I'm planning to run our system on LongMemEval_S using the same methodology as mem0/supermemory's published numbers and compare directly to their published results, rather than running all four systems myself. Anyone done this and hit issues?
Manipulability of benchmarks. Everyone in this space knows you can game these. Prompt tuning, judge model selection, ingestion granularity, dataset curation. How seriously should anyone (us, investors, users) actually take a single benchmark number? What would a more honest and useful eval framework look like?
For builders not in the memory infra category, how do you communicate that you're using memory as a foundation rather than competing on memory infrastructure benchmarks? The category distinction matters but technical reviewers default to "show me the numbers."
Subset vs full runs. Has anyone published or seen credible results from running 50-100 questions instead of the full 500 to validate the harness first? Does the community treat partial runs as legitimate or dismiss them?

Not asking anyone to do our homework. Just want to learn from people who've navigated this. Happy to share back what we learn from running the evals.

Thanks.