
Something interesting dropped this week in the agentic AI space. Kevin Gu from Third Layer Team open-sourced 'AutoAgent' — an open source library for autonomously improving an agent harness on any domain.
Kevin Gu from Third Layer Team open-sourced 'AutoAgent' — an open source library for autonomously improving an agent harness on any domain. The idea is straightforward: instead of manually iterating on system prompts and tool definitions, a meta-agent does the iteration for you overnight.
It modifies agent. py — the single file containing the system prompt, tool definitions, and orchestration logic — runs the benchmark, checks the score, keeps the change if it helped, reverts if it didn't, and repeats.
The human's only job is writing program.md, a plain Markdown file that tells the meta-agent what kind of agent to build.
In a 24-hour run, it reached #1 on SpreadsheetBench (96.5%) and the top GPT-5 score on TerminalBench (55.1%). Every other entry on those leaderboards was hand-tuned by humans.
A few things worth noting for devs thinking about this:
-- On the architecture: Tasks follow Harbor's open format and run inside Docker containers, so the approach is domain-agnostic. Any task you can express as a numeric score (0.0–1.0) becomes something the meta-agent can optimize against.
-- On model pairing: Community discussion around the project has surfaced an interesting observation — when a Claude meta-agent optimized a Claude task agent, it seemed to diagnose failure modes more accurately than when optimizing a GPT-based agent. The researchers called it "model empathy." It's an early empirical observation, not a formal result, but worth keeping in mind when choosing your meta-agent.
-- On what this changes practically: The shift isn't dramatic in terms of tooling, you still write prompts, define tasks, and review outputs. What changes is the iteration loop. Rather than running that loop manually, you delegate it.
The repo is MIT-licensed. Requirements are Docker, Python 3.10+, and uv.
