u/coldoven

Help wanted: Should PII redaction be a mandatory pre-index stage in RAG pipelines?

Help wanted: Should PII redaction be a mandatory pre-index stage in RAG pipelines?

We’re experimenting with enforcing PII redaction as a structural ingestion stage in a local/open-source RAG pipeline.

A lot of stacks effectively do:

raw docs -> chunk -> embed -> retrieve -> mask output

But if docs contain emails, names, phone numbers, employee IDs, etc., the vector index is already derived from sensitive data. Retrieval-time masking only affects rendering.

We’re testing a stricter pipeline:

docs -> docs__pii_redacted -> chunk -> embed

This reduces the attack surface of the index itself instead of relying on output filtering.

Open-source prototype, not at all close to production-ready:
https://github.com/mloda-ai/rag_integration

We’re especially looking for feedback on:

  • whether pre-index redaction is actually the right boundary
  • recall degradation vs privacy tradeoffs
  • better PII detection approaches
  • failure modes we’re missing
u/coldoven — 10 hours ago
▲ 7 r/ArtificialInteligence+2 crossposts

Is it a mistake to treat PII filtering as a retrieval-time step instead of an ingestion constraint in RAG?

It seems like RAG pipelines often do:

raw docs -> chunk -> embed -> retrieve -> mask output

But if documents contain emails, phone numbers, names, employee IDs, etc., the vector index is already derived from sensitive data.

docs -> docs__pii_redacted -> chunk -> embed

Invariant: unsanitized text never gets chunked or embedded.

This seems safer from a data-lineage / attack-surface perspective, especially for local or enterprise RAG systems.

Or am I wrong?

Example: https://github.com/mloda-ai/rag_integration/blob/main/demo.ipynb

u/coldoven — 10 hours ago