r/machinelearningnews

Something interesting dropped this week in the agentic AI space. Kevin Gu from Third Layer Team open-sourced 'AutoAgent' — an open source library for autonomously improving an agent harness on any domain.

Something interesting dropped this week in the agentic AI space. Kevin Gu from Third Layer Team open-sourced 'AutoAgent' — an open source library for autonomously improving an agent harness on any domain.

Kevin Gu from Third Layer Team open-sourced 'AutoAgent' — an open source library for autonomously improving an agent harness on any domain. The idea is straightforward: instead of manually iterating on system prompts and tool definitions, a meta-agent does the iteration for you overnight.

It modifies agent. py — the single file containing the system prompt, tool definitions, and orchestration logic — runs the benchmark, checks the score, keeps the change if it helped, reverts if it didn't, and repeats.

The human's only job is writing program.md, a plain Markdown file that tells the meta-agent what kind of agent to build.

In a 24-hour run, it reached #1 on SpreadsheetBench (96.5%) and the top GPT-5 score on TerminalBench (55.1%). Every other entry on those leaderboards was hand-tuned by humans.

A few things worth noting for devs thinking about this:

-- On the architecture: Tasks follow Harbor's open format and run inside Docker containers, so the approach is domain-agnostic. Any task you can express as a numeric score (0.0–1.0) becomes something the meta-agent can optimize against.

-- On model pairing: Community discussion around the project has surfaced an interesting observation — when a Claude meta-agent optimized a Claude task agent, it seemed to diagnose failure modes more accurately than when optimizing a GPT-based agent. The researchers called it "model empathy." It's an early empirical observation, not a formal result, but worth keeping in mind when choosing your meta-agent.

-- On what this changes practically: The shift isn't dramatic in terms of tooling, you still write prompts, define tasks, and review outputs. What changes is the iteration loop. Rather than running that loop manually, you delegate it.

The repo is MIT-licensed. Requirements are Docker, Python 3.10+, and uv.

Full analysis: https://www.marktechpost.com/2026/04/05/meet-autoagent-the-open-source-library-that-lets-an-ai-engineer-and-optimize-its-own-agent-harness-overnight/

Repo: https://github.com/kevinrgu/autoagent/tree/main

marktechpost.com
u/ai-lover — 2 hours ago
Simulating Thought vs. Sustaining It
▲ 0 r/machinelearningnews+1 crossposts

Simulating Thought vs. Sustaining It

If large language models generate text by selecting tokens from probability distributions, then what appears as reasoning is, at its core, a sequence of statistically guided steps rather than a process of internally constructing arguments in the way we intuitively understand thinking. Each token follows from the previous ones, conditioned by learned patterns, not by an evolving internal commitment to a line of thought. What we perceive as structure—arguments, chains, logic—is therefore not necessarily something being built in real time, but something being expressed because similar structures existed in the training data.

This distinction becomes clearer when looking at how these systems operate during generation. There is no autonomous goal formation, no persistent internal state that carries over beyond the current interaction, and no self-modification during inference. The model does not decide to pursue a line of reasoning and then update itself as it progresses. Instead, it produces a trajectory through a space of possible continuations, one token at a time. The coherence we observe is real, but it is local and conditional, not the result of a stable internal process unfolding over time.

This is also why common interventions—better prompting, assigning roles, or adding more context—eventually reach their limits. These techniques can shape the distribution from which tokens are selected, making outputs more consistent, more aligned, or more constrained. But they do not alter the underlying mechanism. They do not introduce persistence, they do not create durable commitments, and they do not enable the system to carry a structured state forward across interactions. They operate entirely on the surface level, refining what is produced without changing how production fundamentally works.

If something like thinking is to be taken seriously in a non-metaphorical sense, then additional properties would be required. There would need to be a form of persistent state—representations that endure beyond a single generation pass. There would need to be update dynamics, meaning the system can modify that state based on outcomes, not just produce outputs but change its own future behavior in a causally meaningful way. And there would need to be constraint binding, where commitments—plans, goals, invariants—actually restrict what can happen next, rather than merely being described in text.

None of these properties exist within the standard token generation process itself. Where they begin to appear is not inside the model’s forward pass, but in the surrounding architecture: external memory systems, tool use, iterative loops that plan, execute, and revise, or slower processes like fine-tuning that adjust parameters over time. In such configurations, traces of persistence and state evolution can emerge, but they are distributed across the system rather than located within the act of token selection itself.

This leads directly to the central question: can a system that does not maintain or update internal state across sessions meaningfully be said to think? Within a single interaction, it can produce outputs that resemble coherent reasoning. But across interactions, without persistence, there is no accumulation, no stabilization, no continuity of an internal process. What exists is a highly refined simulation of the form of thinking, not the maintenance of a thinking process itself.

From this perspective, the issue is not one of control—writing better prompts, defining clearer roles, or providing richer context. Those approaches remain confined to shaping outputs. The deeper question is about mechanism: where state resides, whether it can persist, and whether it can be transformed over time under constraints.

In that sense, what is often interpreted as thinking is better understood as the production of structured outputs without structurally bound internal states. The system does not fail at thinking; it was never designed to sustain a thinking process in the first place.

u/ParadoxeParade — 2 days ago
Week