u/alameenswe

The reason some AI assistants feel smart and others feel dumb has nothing to do with the model

There's a framing that dominates almost every AI evaluation I've seen: which model is powering it?

GPT-5? Claude? Gemini? The implicit assumption is that smarter model = better product.

I think this is mostly wrong, and it's leading teams to optimize the wrong thing.

The frontier models available today are, for most practical purposes, comparable. They're all extraordinarily capable. The variance in user experience between products isn't primarily driven by which model sits underneath.

What actually determines whether an AI assistant feels intelligent — whether it gets better over time, personalizes meaningfully, earns user trust — is whether it has memory.

Not in a vague sense. Concretely: does the agent retain structured context across sessions? Does it remember your preferences without being reminded every time? Can it reference what you discussed three weeks ago?

An agent with no memory treats every user as a stranger on every visit. The best model in the world, configured this way, will feel worse than a less capable model that actually knows who you're talking to.

Three things worth building memory around:

  1. Preferences and style — how the user likes to communicate, what format they want, what to avoid
  2. History and context — what they've worked on, what's been decided, what's been tried
  3. Goals and constraints — what they're actually trying to accomplish and what limits them

When all three are present, "which model are you using?" becomes a secondary question.

Curious if others have noticed this in practice — whether the memory architecture of a tool has meaningfully affected your experience with it more than the underlying model.

reddit.com
u/alameenswe — 18 hours ago

Why we stopped using vector-only retrieval for agent memory (and what we use instead)

when we first built persistent memory into our agent pipeline, we went with vector search — pgvector, cosine similarity, retrieve top-k on each turn. Standard setup, works well, easy to reason about.

It held up fine during development. Started failing in predictable ways in production.

The failure modes we hit:

Exact keyword recall. User asks "what API key prefix did I set for staging?" The stored memory has sk-stg-0041 in it. Vector search on "API key prefix staging" will sometimes surface this — but as the memory store grows and you have dozens of API-related entries, the similarity scores cluster too tightly for reliable ranking. The specific identifier isn't semantically encoded in the embedding. BM25 finds it trivially.

Rare proper nouns. Any specific framework name, company name, or custom identifier that the embedding model hasn't seen enough of doesn't cluster cleanly. Vector search on "Graphiti" doesn't reliably retrieve memories containing the word "Graphiti" unless it happens to sit near semantically similar tokens. BM25 is O(1) on this — it's a string match.

Density at scale. Vector search degrades as the store grows. More memories = more neighbors = noisier retrieval. You can add metadata filtering (by user, recency, topic) but it's a mitigation, not a fix. The precision tail keeps getting worse.

The fix: hybrid retrieval with RRF

We now run vector search and BM25 (via PostgreSQL tsvector) in parallel and merge using Reciprocal Rank Fusion.

typescript

const [vectorResults, bm25Results] = await Promise.all([
  vectorSearch(query, userId),
  keywordSearch(query, userId)
]);
return reciprocalRankFusion(vectorResults, bm25Results);

RRF formula: score = Σ 1 / (k + rank_i) where k=60. Results appearing in both lists get boosted. Results ranking high in one but absent from the other still surface.

The tsvector column is kept updated via a PostgreSQL trigger so there's no separate indexing pipeline.

Running both queries concurrently means the latency hit is ~max(vector_latency, bm25_latency), not the sum. In practice, both run fast enough that the retrieval step stays well under 100ms at p95.

For higher-stakes retrieval (e.g. customer support where a wrong recall causes a real problem), we add a cross-encoder reranker over the top 20 candidates. Adds 30–80ms but meaningfully improves precision on single-hop factual queries.

Anyone else gone down this path? Curious what retrieval setups people are running at scale.

reddit.com
u/alameenswe — 18 hours ago