
Help wanted: Should PII redaction be a mandatory pre-index stage in RAG pipelines?
We’re experimenting with enforcing PII redaction as a structural ingestion stage in a local/open-source RAG pipeline.
A lot of stacks effectively do:
raw docs -> chunk -> embed -> retrieve -> mask output
But if docs contain emails, names, phone numbers, employee IDs, etc., the vector index is already derived from sensitive data. Retrieval-time masking only affects rendering.
We’re testing a stricter pipeline:
docs -> docs__pii_redacted -> chunk -> embed
This reduces the attack surface of the index itself instead of relying on output filtering.
Open-source prototype, not at all close to production-ready:
https://github.com/mloda-ai/rag_integration
We’re especially looking for feedback on:
- whether pre-index redaction is actually the right boundary
- recall degradation vs privacy tradeoffs
- better PII detection approaches
- failure modes we’re missing