
edits to call out some information:
- All local model uses `Q4_K_M` quantization with `llama.cpp` engine
- Main factor contribute to difference with Qwen's official post (59% vs 38%) is probably benchmark task timeout used, then quantization, harness, inference engine etc.
- We expect this can be improved a lot with some prompt/harness/llama.cpp tuning
- updated the diagram
We ran open-weight 27B–32B models on Terminal-Bench 2.0 (89 tasks, terminal-bench-2.git @ 69671fb) through our agent harness. Best result was Qwen 3.6-27B at 38.2% (34/89) under the default per-task timeout — the same constraint the public leaderboard uses (Qwen's official post uses a more relaxed config) . We deliberately used the default setup for TB official leaderboard, because we wanted an apples-to-apples number against the verified leaderboard.
We also did a separate experiment with consumer hardware on token speed. MOE models still have a order of magnitude (15x) better performance compared to dense model with similar size.
The interesting part isn't 38.2% in absolute terms — current verified SOTA is ~80% (GPT-5.5 / Opus 4.6 / Gemini 3.1 Pro). The interesting part is what 38.2% maps to in time.
Anchoring on model release dates of verified leaderboard entries:
- Terminus 2 + Claude Opus 4.1 (released Aug 2025): 38.0%
- Terminus 2 + GPT-5.1-Codex (Nov 2025): 36.9%
- Claude Code + Sonnet 4.5 (Sep 2025): 40.1%
- Codex CLI + GPT-5-Codex (Sep 2025): 44.3%
So today's best runnable-offline coding model lands roughly where the hosted frontier was in late 2025 — about a 6–8 month lag. That's the first time this has been close enough to matter for real deployments (regulated environments, air-gapped, on-prem CI, batch workloads).
more details on our blog: https://antigma.ai/blog/2026/04/24/offline-coding-models