u/MeetVege

Shared RAG index with metadata filters started cracking around 30 tenants

We've been doing customer-facing RAG for about a year. Each customer uploads their own docs, and they only see results from their own corpus.

Started in a single Pinecone index with namespaces per tenant. Worked fine through the first 10 or so customers, then namespace count itself became an ops headache, so we flipped to a single namespace and tenant_id metadata filter on every query. That carried us to maybe customer 18. Then a few things started getting weird.

Recall got noticeably worse for tenants with smaller corpora. I don't have a great theory for why, but my hunch is that hybrid scoring inside a giant shared index starts being dominated by the term distribution of larger tenants. If 80% of your docs are from three big customers, and a fourth customer searches a term that's common in their own docs but rare in the shared corpus, BM25 weights end up looking strange. The vector side was less obviously broken. With top-K retrieval and a metadata filter, small-corpus tenants were sometimes getting fewer than K candidates back at all, which then fed a reranker that didn't have enough to work with.

The other issue was operational. A reindex of any single tenant's docs meant reprocessing them inside the shared ingestion pipeline. Updates to one customer's content sometimes stalled because of an ingestion job from a different customer. Not a great look when the customer with the slow job is also the one paying the most. Granted, that one isn't really an index-topology problem. You could parallelize workers and keep the index shared. But the two failure modes started compounding, and the simplest fix for both at once was just per-tenant everything.

So now I'm trying to decide whether to flip to per-tenant isolated indexes. The downside is obvious. Thirty separate indexes to keep an eye on, plus you're paying for storage thirty times instead of once. You also lose the ability to do cross-tenant analytics, which we do use occasionally for product decisions.

What I keep going back and forth on is whether this is an architectural question or just a "your shared index needs better scoring" question. At 30 tenants both stories are plausible. At 100 I don't know which one breaks first, and the migration cost of switching topologies later is not small.

Mostly trying to figure out how other people drew the line.

reddit.com
u/MeetVege — 1 day ago

Shared RAG index with metadata filters started cracking around 30 tenants

We've been doing customer-facing RAG for about a year. Each customer uploads their own docs, and they only see results from their own corpus.

Started in a single Pinecone index with namespaces per tenant. Worked fine through the first 10 or so customers, then namespace count itself became an ops headache, so we flipped to a single namespace and tenant_id metadata filter on every query. That carried us to maybe customer 18. Then a few things started getting weird.

Recall got noticeably worse for tenants with smaller corpora. I don't have a great theory for why, but my hunch is that hybrid scoring inside a giant shared index starts being dominated by the term distribution of larger tenants. If 80% of your docs are from three big customers, and a fourth customer searches a term that's common in their own docs but rare in the shared corpus, BM25 weights end up looking strange. The vector side was less obviously broken. With top-K retrieval and a metadata filter, small-corpus tenants were sometimes getting fewer than K candidates back at all, which then fed a reranker that didn't have enough to work with.

The other issue was operational. A reindex of any single tenant's docs meant reprocessing them inside the shared ingestion pipeline. Updates to one customer's content sometimes stalled because of an ingestion job from a different customer. Not a great look when the customer with the slow job is also the one paying the most. Granted, that one isn't really an index-topology problem. You could parallelize workers and keep the index shared. But the two failure modes started compounding, and the simplest fix for both at once was just per-tenant everything.

So now I'm trying to decide whether to flip to per-tenant isolated indexes. The downside is obvious. Thirty separate indexes to keep an eye on, plus you're paying for storage thirty times instead of once. And you lose the ability to do cross-tenant analytics, which we do use occasionally.

Been prototyping with Denser Retriever for the last couple of weeks partly because its data model treats a knowledge base as a first-class resource with its own ID, and you create one through the same API you upload docs to. Per-tenant KB ends up being one POST per customer signup, which is the cleanest take on this I've come across. Not sure I've stress-tested it enough at scale yet to claim anything beyond "the ergonomics are easier."

The thing I'm still stuck on is whether this is an architectural question or just a "your shared index needs better scoring" question. At 30 tenants both stories are plausible. At 100 I don't know which one breaks first.

Mostly trying to figure out how other people drew the line.

reddit.com
u/MeetVege — 1 day ago

How are you actually monitoring a RAG pipeline in prod? Inherited one and there's basically nothing to look at

Maybe this is a known thing but it keeps catching me off guard.

We have a RAG service running an internal assistant. Lived with the AI/ML team for a year and change, just got moved to my team in the last reorg. Code runs, embeddings get computed on schedule, vector store updates. From a pipeline perspective it looks like a normal cron'd job, exit zero or paged.

Then i started asking the things i ask about every other data asset and the answers were either bad or didn't exist. Freshness SLA? "Whatever the cron is." So 6 hours, sometimes more if a batch hangs. Quality monitoring? Don't really have any. Users complain in slack and that's the signal. If an embedding job half fails and leaves a doc in a weird state in the index, would anyone notice? Long pause, then "eventually i guess." No view on whether retrieval is trending differently this week vs last.

Coming from dbt + airflow + a few years of pushing Great Expectations on every team i can find, this feels like 2014. No row count equivalent. The closest thing in scope is "did the embedding job exit zero." That's it.

Started pulling raw logs into duckdb to look and some of it was rough. Roughly 7% of live queries are serving an answer despite the top retrieval score being well below where i'd want it. No abstention, no flag, no escalation. The model just keeps talking and the user takes it on faith.

The other thing bugging me, none of the metrics the AI team uses are useful for this in production. They care about precision@k on an offline eval set they hand curated almost a year ago. I care about how the thing is actually behaving in prod this week, which they have no view on.

Tons written about RAG retrieval quality, almost nothing about RAG observability as a thing you actually operationalize. Would honestly be glad to hear from anyone who's built a real monitoring layer here. Otherwise it kind of feels like we're all running on user complaint driven signal.

reddit.com
u/MeetVege — 8 days ago

About 60% of my invoices now settle in USDC because half my clients are in places where wire transfers are either expensive or just stuck for days. Cool in theory.

In practice the off-ramp is where everything breaks. Wise flagged my account twice in a quarter once it noticed regular crypto deposits. Revolut's crypto sell limits are fine until rent is actually due. Local exchange in Tbilisi or Bangkok depending on where I am, but spread + withdrawal fees end up around 2-3% before I even get cash, and ATM fees bite again.

I keep hearing nomads say they just hold stables and "spend directly," which sounds great until I try to figure out what that actually means at a Tesco self-checkout.

Tried a couple of CEX-backed cards (Crypto.com, BitPay style) but the custody piece always made me uneasy after FTX. The self-custody side seems newer and I haven't really pressure-tested any of them.

Mostly the bit I keep getting stuck on is the last mile from chain to checkout without giving up custody or eating 3% twice.

reddit.com
u/MeetVege — 14 days ago

Just got back from my first osaka trip, 3 nights in feb. had a great time and dotonbori absolutely lived up to it the first night, the energy is real. did kuromon market, osaka castle, took the day in nara, ate way too much takoyaki, all good stuff.

but by night 2 i kept getting this feeling that the parts i was seeing were the very top layer and there was a whole other osaka i wasnt getting to. couple things made me think that:

we wandered into tenjinbashisuji one afternoon almost by accident and ended up eating at this small place off a side street where nobody spoke english and it was the best meal of the trip. completely different vibe from the main tourist drag. then on the last night we got tired of the dotonbori area and went looking for somewhere quieter, ended up in what i later learned people call ura-namba, tiny standing bars, alleys, locals after work. felt like we'd been one block away from this for 2 days without knowing.

and shinsekai i didnt make it to but everyone whose opinion i trust says i screwed up by skipping it.

so basically im realizing 3 nights wasnt enough and im already thinking about coming back. for people who know osaka well, what's the layer past the obvious stuff? specific neighborhoods, specific kinds of places, things that take a couple visits to find? happy to pin a list for whenever i make it back.

u/MeetVege — 23 days ago