
Independent eval of Openai/privacy-filter vs GLiNER on 600 PII samples. The model is much better than naive benchmarks make it look
OpenAI dropped Privacy Filter last month under Apache 2.0 and I wanted to see how it actually stacks up against the other serious open weight option for PII detection, GLiNER large-v2.1. Ran a full head to head on 600 labeled samples from ai4privacy (400 English, 200 across French, German, Spanish, Italian, Dutch).
The headline finding is that openai/privacy-filter is genuinely strong, but you'd never know it from a quick benchmark.
Here's why:
Openai/privacy-filter is a token classifier with a GPT style BPE tokenizer. BPE prepends a space to most tokens, so when you decode token boundaries back to character offsets, every span is off by one character compared to a human annotation. Score the model with strict exact span matching, which is the obvious first thing to do, and it looks much worse than it is. Almost every "miss" is actually a correct detection with a one character offset.
The numbers tell the story:
| Model | Strict F1 | Boundary F1 |
|---|---|---|
| GLiNER large-v2.1 | 0.367 | 0.416 |
| openai/privacy-filter | 0.155 | 0.498 |
The 0.34 strict to boundary gap for openai/privacy-filter is entirely tokenizer artifact, not real misses. Once you score with boundary overlap (any character overlap with correct label), the model wins overall.
Per category on boundary scoring (English):
- EMAIL: openai 0.99, GLiNER 0.73
- PHONE: openai 0.67, GLiNER 0.51
- PERSON: openai 0.69, GLiNER 0.62
- DATE: openai 0.27, GLiNER 0.26
- ADDRESS: GLiNER 0.39, openai 0.37
EMAIL is essentially solved. 0.987 F1 in English, 1.000 across the multilingual set.
A few other things worth knowing if you're considering deploying it:
- It's faster than GLiNER on CPU (~2.8 vs ~1.1 samples/sec) thanks to the MoE sparse activation. 1.5B total params but only 50M active per forward pass.
- Multilingual performance is actually stronger than English on boundary scoring. Counterintuitive given the model card flags non-English as a risk, but the numbers are what they are.
- The model is more conservative than GLiNER. Higher precision, lower recall. If you're building a redaction pipeline where missing PII is unacceptable, GLiNER's recall heavy profile may be a better fit. If false positives break downstream parsing, openai/privacy-filter wins.
- It needs
trust_remote_code=Trueand the dev branch of transformers right now. The model class hasn't landed in a stable release yet. Mildly annoying but not a blocker. - The eight categories are fixed (person, address, email, phone, url, date, account_number, secret). For anything outside that you'd need GLiNER's zero shot interface.
Two openai/privacy-filter categories (account_number and secret) had no equivalent gold labels in ai4privacy and were excluded from scoring. A finance or credentials heavy dataset would be needed to evaluate those.
Full writeup, Code, predictions and all CSVs in the comments below 👇
Disclosure: I work on Neo AI Engineer, and the eval pipeline was built by Neo from a single prompt. I reviewed the methodology and validated the results before publishing. The numbers and findings stand on their own, happy to talk about the agent side separately if anyone's interested.