u/InvincibleLuck

Quick share of a paper we got into RLC 2026.

The Eureka-style line of work uses LLMs to write reward functions. It assumes the observation space is already good. We tested that assumption and it doesn't hold on harder gridworld tasks, even a perfectly shaped LLM-written reward gets ~7% success because the policy can't see the right features. On continuous control, the opposite happens: the raw state is fine but sparse reward kills learning.

So we built LIMEN, which jointly evolves observations and rewards as executable Python programs. LLM mutates, PPO scores, MAP-Elites archive keeps diversity. 30 iterations per run.

Result: joint evolution is the only setup that doesn't catastrophically fail on at least one of our 5 tasks. Reward-only and observation-only each have a domain they completely break on.

A couple of things we found interesting:

- The LLM rediscovers classic RL tricks unprompted, potential-based shaping, directional indicators, multi-scale Gaussians, milestone bonuses.

- Without the feedback loop, just sampling 30 candidates from the same prompt gets nowhere. The evolutionary loop is doing real work, not just the LLM's prior.

- Runs on a single L4. $3–11 of API calls per task.

Paper: https://arxiv.org/abs/2605.03408
Website: https://akshat-sj.github.io/limen/

We built an LLM based evolutionary system that can redesign the RL task itself, not just the reward (Accepted at RLC 2026)