u/syedshad

When working with agents, we spend a lot of time tuning prompts and skills by hand, so we built EvoSkill to automate that loop for agents like Claude Code!!

Our EvoSkill loop, per iteration:

Runs the agent on a benchmark, collects failure traces
Proposes skill or prompt mutations aimed at specific failure modes
Scores mutations on held-out data, maintains a frontier of top-N programs
Tracks everything as git branches for reproducibility

Each "program" is a (system prompt, skill set) pair, and the algorithm runs for a configurable number of iterations.

Results so far, with Claude Code and Opus 4.5:

OfficeQA: 60.6% → 68.1%
SealQA: 26.6% → 38.7%
BrowseComp: 43.5% → 48.8% using a skill evolved from SealQA and transferred zero-shot

The transfer result is the one that surprised us — it suggests at least some of the evolved skills capture general strategies rather than benchmark-specific tricks. Caveat: it's one benchmark pair, and the two are both browsing-heavy reasoning tasks, so transfer between them makes sense.

Honest limitations:

You need a good benchmark and a reasonable scoring function — if those are weak, the loop is not able to propose good improvements.
Evolution burns lots of API tokens, so the cost/benefit depends on how much you'll reuse the resulting skills.

EvoSkill works well with Claude Code and also tested with OpenCode SDK, OpenHands, Goose, and Codex CLI.

This is the first release from our “AI evolution” lab, so please give it a try—we’d love your feedback—especially if you’ve used tools like DSPy / GEPA!

Repo: https://github.com/sentient-agi/EvoSkill
Paper: https://arxiv.org/abs/2603.02766

P.S. vLLM / Ollama support coming soon!

EvoSkill: Automatic Self-Improvement Tool for AI Agents [open source]