
WOZCODE just showed up on terminal-bench 2.0 on hugging face
Our newest Terminal-Bench 2.0 submission, powered by Claude Opus 4.7, reached 80.2% accuracy across 89 tasks with 5 attempts per task (445 total trials), and the run has passed validation.
We view this result as meaningful for three reasons:
- Evaluation depth: the score reflects repeated performance across a broad task set, not a single-pass run.
- Execution realism: Terminal-Bench tests agents in terminal-based workflows where success depends on tool use, state management, multi-step reasoning, and reliable completion under realistic constraints.
- Validation rigor: passing validation matters because reproducibility and benchmark integrity are critical when evaluating agent systems.
As the space matures, we believe the most important progress will come from systems that are not only capable, but also consistent and dependable in real operating environments. This result is a strong step in that direction for WOZCODE.
Submission details:
https://huggingface.co/datasets/harborframework/terminal-bench-2-leaderboard/discussions/148