u/gvij

Independent eval of Openai/privacy-filter vs GLiNER on 600 PII samples. The model is much better than naive benchmarks make it look
▲ 33 r/LocalLLM+3 crossposts

Independent eval of Openai/privacy-filter vs GLiNER on 600 PII samples. The model is much better than naive benchmarks make it look

OpenAI dropped Privacy Filter last month under Apache 2.0 and I wanted to see how it actually stacks up against the other serious open weight option for PII detection, GLiNER large-v2.1. Ran a full head to head on 600 labeled samples from ai4privacy (400 English, 200 across French, German, Spanish, Italian, Dutch).

The headline finding is that openai/privacy-filter is genuinely strong, but you'd never know it from a quick benchmark.

Here's why:

Openai/privacy-filter is a token classifier with a GPT style BPE tokenizer. BPE prepends a space to most tokens, so when you decode token boundaries back to character offsets, every span is off by one character compared to a human annotation. Score the model with strict exact span matching, which is the obvious first thing to do, and it looks much worse than it is. Almost every "miss" is actually a correct detection with a one character offset.

The numbers tell the story:

Model Strict F1 Boundary F1
GLiNER large-v2.1 0.367 0.416
openai/privacy-filter 0.155 0.498

The 0.34 strict to boundary gap for openai/privacy-filter is entirely tokenizer artifact, not real misses. Once you score with boundary overlap (any character overlap with correct label), the model wins overall.

Per category on boundary scoring (English):

  • EMAIL: openai 0.99, GLiNER 0.73
  • PHONE: openai 0.67, GLiNER 0.51
  • PERSON: openai 0.69, GLiNER 0.62
  • DATE: openai 0.27, GLiNER 0.26
  • ADDRESS: GLiNER 0.39, openai 0.37

EMAIL is essentially solved. 0.987 F1 in English, 1.000 across the multilingual set.

A few other things worth knowing if you're considering deploying it:

  • It's faster than GLiNER on CPU (~2.8 vs ~1.1 samples/sec) thanks to the MoE sparse activation. 1.5B total params but only 50M active per forward pass.
  • Multilingual performance is actually stronger than English on boundary scoring. Counterintuitive given the model card flags non-English as a risk, but the numbers are what they are.
  • The model is more conservative than GLiNER. Higher precision, lower recall. If you're building a redaction pipeline where missing PII is unacceptable, GLiNER's recall heavy profile may be a better fit. If false positives break downstream parsing, openai/privacy-filter wins.
  • It needs trust_remote_code=True and the dev branch of transformers right now. The model class hasn't landed in a stable release yet. Mildly annoying but not a blocker.
  • The eight categories are fixed (person, address, email, phone, url, date, account_number, secret). For anything outside that you'd need GLiNER's zero shot interface.

Two openai/privacy-filter categories (account_number and secret) had no equivalent gold labels in ai4privacy and were excluded from scoring. A finance or credentials heavy dataset would be needed to evaluate those.

Full writeup, Code, predictions and all CSVs in the comments below 👇

Disclosure: I work on Neo AI Engineer, and the eval pipeline was built by Neo from a single prompt. I reviewed the methodology and validated the results before publishing. The numbers and findings stand on their own, happy to talk about the agent side separately if anyone's interested.

u/gvij — 1 day ago

Open opportunities to contribute in openclaw right now - two bugs biting people in production, two features the community's been asking for

Was going through AI Signals today which tracks open issues across trending 300+ AI/ML repos. Pulled up openclaw (366k stars) and these four stood out. Sharing for anyone who wants to contribute or just knows what's coming.

1. Images sent through channels never reach the model Discord, Telegram, Feishu, OpenWebUI — when a user sends an image, the channel adapter strips it and passes only text to the model. Vision-capable models on the other end respond as if no image was provided, or hallucinate descriptions. The fix is extending the channel adapter interface to pass image URLs and base64 payloads through. Multiple issues open on this, #23452 is the clearest summary.

2. WhatsApp and Telegram die permanently on a brief DNS blip The reconnect logic on both channel adapters only handles mid-session disconnects. If a DNS failure hits during initial connection, the error escapes the retry loop entirely and the channel exits with no further reconnect attempts. Gateway keeps running, channel is just dead. People are running external watchdog scripts to restart it. Issues #2198 and #13506.

3. No way to see what's holding the session lock The session store uses .jsonl.lock files and they get stuck regularly — large sessions, cron load, parallel agents, config reloads. When it happens all models fail with session file locked (timeout 10000ms) and your only option is to manually kill the lock file or restart the gateway. The lock file already contains pid and createdAt but there's no CLI command to surface that. An openclaw locks command reading that metadata and showing active waiters would give operators something to actually work with. Issue #31489 and #11950.

4. No parallel coordination between agents Current session_spawn and send tools only support hierarchical delegation — one agent passes work down and waits. There's no way for multiple agents to collaborate on a task simultaneously or share state. Issue #43367 shows people trying to run parallel coding agents and hitting config overwrites and lock contention on top of the architecture limitation.

Issues and AI Signals link in the comments below 👇

reddit.com
u/gvij — 2 days ago
▲ 1 r/github

Copilot SDK requires you to start a session just to list available Agents, Skills, or MCP configs - no enumeration API yet

If you're building on the Copilot SDK and want to show users what agents or skills are available before they start a conversation, you're stuck. The only way to enumerate them right now is to create a full session first.

VS Code team flagged this in issue #1161 because they want to surface these in the UI pre-session. Makes sense. Feels like a pretty fundamental gap for anyone building tooling on top of the SDK.

SDK is in public preview so hopefully this gets prioritized. Anyone else running into this while building extensions or integrations?

Issue link in comments below.

reddit.com
u/gvij — 2 days ago
🔥 Hot ▲ 801 r/LocalLLM+2 crossposts

Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation

Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer.

Benchmarks used:

  • HumanEval: code generation
  • HellaSwag: commonsense reasoning
  • BFCL: function calling

Total samples:

  • HumanEval: 164
  • HellaSwag: 100
  • BFCL: 400

Results:

BF16

  • HumanEval: 56.10% 92/164
  • HellaSwag: 90.00% 90/100
  • BFCL: 63.25% 253/400
  • Avg accuracy: 69.78%
  • Throughput: 15.5 tok/s
  • Peak RAM: 54 GB
  • Model size: 53.8 GB

Q4_K_M

  • HumanEval: 50.61% 83/164
  • HellaSwag: 86.00% 86/100
  • BFCL: 63.00% 252/400
  • Avg accuracy: 66.54%
  • Throughput: 22.5 tok/s
  • Peak RAM: 28 GB
  • Model size: 16.8 GB

Q8_0

  • HumanEval: 52.44% 86/164
  • HellaSwag: 83.00% 83/100
  • BFCL: 63.00% 252/400
  • Avg accuracy: 66.15%
  • Throughput: 18.0 tok/s
  • Peak RAM: 42 GB
  • Model size: 28.6 GB

What stood out:

Q4_K_M looks like the best practical variant here. It keeps BFCL almost identical to BF16, drops about 5.5 points on HumanEval, and is still only 4 points behind BF16 on HellaSwag.

The tradeoff is pretty good:

  • 1.45x faster than BF16
  • 48% less peak RAM
  • 68.8% smaller model file
  • nearly identical function calling score

Q8_0 was a bit underwhelming in this run. It improved HumanEval over Q4_K_M by ~1.8 points, but used 42 GB RAM vs 28 GB and was slower. It also scored lower than Q4_K_M on HellaSwag in this eval.

For local/CPU deployment, I would probably pick Q4_K_M unless the workload is heavily code-generation focused. For maximum quality, BF16 still wins.

Evaluation setup:

  • GGUF via llama-cpp-python
  • n_ctx: 32768
  • checkpointed evaluation
  • HumanEval, HellaSwag, and BFCL all completed
  • BFCL had 400 function calling samples

This evaluation was done using Neo AI Engineer, which built the GGUF eval setup, handled checkpointed runs, and consolidated the benchmark results. I manually reviewed the outcome as well.

Complete case study with benchmarking results, approach and code snippets in mentioned in the comments below 👇

u/gvij — 4 days ago
🔥 Hot ▲ 192 r/claude+4 crossposts

Kimi K2.6 vs Claude Opus 4.7 on autonomous coding tasks

Ran a small head-to-head eval between Kimi K2.6 and Claude Opus 4.7 on 10 hard reasoning, coding, and analysis tasks.

Setup:

  • Kimi: moonshotai/kimi-k2.6
  • Opus: anthropic/claude-opus-4.7
  • Both via OpenRouter
  • Judge: GPT-5.4
  • A/B anonymized judging
  • 10 tasks total

Results:

  • Kimi wins: 6
  • Opus wins: 4
  • Ties: 0
  • Avg judge score: Opus 8.0, Kimi 7.2
  • Avg latency: Opus 29.7s, Kimi 496.8s
  • Avg total tokens: Opus 3,561, Kimi 14,297

The interesting part is that Kimi won more tasks, but Opus had the higher average score.

Kimi was stronger on tasks where exhaustive reasoning and detailed coverage mattered. It won the Zebra puzzle, causal inference, Redis rate limiter, production memory leak debugging, autonomous vehicle ethics, and Alzheimer’s trial critique.

Opus was much faster, more concise, and more reliable. It won the St. Petersburg paradox, distributed ID generator, query optimization, and repeated duopoly game theory task.

Kimi also had two bad failure cases: one upstream JSONDecodeError from OpenRouter/Moonshot, and one response that spent around 21k completion tokens in reasoning but never emitted final content. Opus completed all 10 tasks cleanly.

My takeaway:

Kimi K2.6 is surprisingly strong when it completes properly, especially for deep reasoning and long-form implementation tasks.

But Opus 4.7 is much faster and more predictable. For interactive coding agents, Opus still feels safer. For slower offline evals or deep analysis, Kimi looks very interesting.

The eval was performed by Neo AI engineer.

Complete breakdown of the evaluation along with approach, code, prompts in mentioned in comments below 👇

This was a small eval, only 10 tasks, so don’t treat this as a full benchmark. But the result was interesting enough to share.

u/gvij — 4 days ago
▲ 43 r/Qwen_AI

Qwen 3.6 27B vs Qwen 3.6 35B A3B vs Gemma 4 models Throughput on H100

Ran a serving benchmark on 8 small and mid-size models on a single H100 80GB to figure out which ones are actually worth running in production.

Setup:

- vLLM 0.19.1, vllm bench serve

- 100 prompts per run, 128 in / 128 out tokens

- Concurrency: 1, 4, 8, 16

- Metrics: throughput (tok/s) and TTFT (ms)

Throughput at c=16 (tok/s):

- Gemma 4 E2B-it: 3180

- Gemma 4 E4B-it: 2015

- Qwen 3.6 35B-A3B-FP8: 1243

- Gemma 4 26B-A4B-it: 1033

- Qwen 3.6 35B-A3B: 718

- Qwen 3.6 27B-FP8: 557

- Qwen 3.6 27B: 439

- Gemma 4 31B-it: 226

Three findings:

  1. Small expert models dominate. Gemma E2B hit 14x the throughput of Gemma 31B dense on the same GPU. TTFT under load: 55 ms vs 4.1 seconds. Architecture is eating parameter count for serving workloads.

  2. FP8 is a bigger win on MoE than dense. Qwen 35B-A3B FP8 vs BF16: +73% throughput. Qwen 27B dense FP8 vs BF16: +27%. MoE benefits more because expert weight movement through HBM is the bottleneck, and FP8 halves that traffic. For MoE on H100, FP8 should be the default now.

  3. Dense 30B-class models don't serve on a single H100. Gemma 31B dense TTFT goes from 130 ms at c=1 to 4159 ms at c=16. Treat it as a batch model, not a serving model.

Who should use what (just my personal preference, you should run your own evals):

- Latency-sensitive chat: Gemma 4 E2B-it

- High-throughput batch: Gemma E2B-it, or E4B if you need more capability

- Quality + speed balance: Qwen 3.6 35B-A3B in FP8 (~1,200 tok/s)

- Skip dense 27B and 31B unless you have a specific reason

Disclosure: The complete experimentation setup, evaluation and analysis was performed end to end by Neo AI Engineer based on my initial task prompt and then I evaluated the final outcome manually.

u/gvij — 6 days ago

GitHub trending tracker built for contributors. Shows open-issue counts alongside growth so you can find projects you can actually help with

The workflow this solves: I want to contribute to open source, I check GitHub trending, I see what's popular, but I have no idea which of those repos has a contributor-friendly issue queue. So I open tabs, drill into Issues, scan for help-wanted labels, get tired, close everything.

This tool shows both axes in one view. Top 360 repos in AI/ML and SWE, sorted by stars / forks / 24h growth / momentum. Each row pulls live open-issue counts from GitHub split into features, bugs, and enhancements.

The pattern that emerges when you put both axes together:

  • Megaprojects (Linux, React, transformers) are popular but have tight issue queues. Hard to break in.
  • Stagnant repos have lots of open issues but no momentum. Your PR sits forever.
  • Mid-size rising repos with healthy issue counts are the actual contributor sweet spot. Visible work, responsive maintainers, real entry points.

This tool makes that third category easy to find.

A few examples from today's data:

  • openclaw: AI assistant repo, +572 stars in 24h, 913 open enhancements
  • everything-claude-code: agent harness, +1.1k stars in 24h, 145 open enhancements
  • ollama: +75 stars, 28 open issues, very active maintainer team

Project link is in the comments below 👇

Built by NEO AI Engineer. Posting here because the contributor-flow angle felt like a fit for this subreddit.

u/gvij — 8 days ago