u/sn1pr0s

What did your agents ship this week? - Weekly thread for sharing wins, fails, and learnings.

Share what your AI agents accomplished (or failed at) this past week.

Format to follow (or don't, up to you):

  • Agent: (Claude Code / Cursor / Codex / Aider / other)
  • Task: What you asked it to do
  • Result: What actually happened - PR link, screenshot, or description
  • Time: How long it took
  • Cost: If you know it
  • Verdict: Would you trust the output? Did you merge it?

Wins, fails, and "it wrote 500 lines of code that compiled but did the wrong thing" stories are all welcome. The ugly runs are often more useful than the perfect ones.

reddit.com
u/sn1pr0s — 3 days ago

What did your agents ship this week? - Weekly thread for sharing wins, fails, and learnings.

Share what your AI agents accomplished (or failed at) this past week.

Format to follow (or don't, up to you):

  • Agent: (Claude Code / Cursor / Codex / Aider / other)
  • Task: What you asked it to do
  • Result: What actually happened - PR link, screenshot, or description
  • Time: How long it took
  • Cost: If you know it
  • Verdict: Would you trust the output? Did you merge it?

Wins, fails, and "it wrote 500 lines of code that compiled but did the wrong thing" stories are all welcome. The ugly runs are often more useful than the perfect ones.

reddit.com
u/sn1pr0s — 10 days ago

I put together agent-benchmarks.com as an accessible reference for the current state of AI agent evaluation.

The site covers the evaluation pipeline, three failure modes (contamination, saturation, gaming), and includes a landscape page mapping 70 benchmarks across coding, math, research, science, and other domains.

This was motivated by the SWE-smith findings on reward hacking in coding benchmarks and the Berkeley RDI team's recent systematic audit showing exploitability across benchmarks.

I'd appreciate feedback!

agent-benchmarks.com
u/sn1pr0s — 23 days ago

Share what your AI agents accomplished (or failed at) this past week.

Format to follow (or don't, up to you):

  • Agent: (Claude Code / Cursor / Codex / Aider / other)
  • Task: What you asked it to do
  • Result: What actually happened - PR link, screenshot, or description
  • Time: How long it took
  • Cost: If you know it
  • Verdict: Would you trust the output? Did you merge it?

Wins, fails, and "it wrote 500 lines of code that compiled but did the wrong thing" stories are all welcome. The ugly runs are often more useful than the perfect ones.

reddit.com
u/sn1pr0s — 24 days ago

Share what your AI agents accomplished (or failed at) this past week.

Format to follow (or don't, up to you):

  • Agent: (Claude Code / Cursor / Codex / Aider / other)
  • Task: What you asked it to do
  • Result: What actually happened - PR link, screenshot, or description
  • Time: How long it took
  • Cost: If you know it
  • Verdict: Would you trust the output? Did you merge it?

Wins, fails, and "it wrote 500 lines of code that compiled but did the wrong thing" stories are all welcome. The ugly runs are often more useful than the perfect ones.

reddit.com
u/sn1pr0s — 1 month ago

After having autonomous agents for some time now, I've learned a few lessons. One of them is that agents don't ask questions in Slack. They try to figure it out and either cost $$$ every time they try, or they fail.

It's very relevant for environment setup - each agent, for each task, will try to setup your environment from scratch. It will cost you - time, but mostly tokens.

What's your takeaways?

x.com
u/sn1pr0s — 1 month ago

Share what your AI agents accomplished (or failed at) this past week.

Format to follow (or don't, up to you):

  • Agent: (Claude Code / Cursor / Codex / Aider / other)
  • Task: What you asked it to do
  • Result: What actually happened - PR link, screenshot, or description
  • Time: How long it took
  • Cost: If you know it
  • Verdict: Would you trust the output? Did you merge it?

Wins, fails, and "it wrote 500 lines of code that compiled but did the wrong thing" stories are all welcome. The ugly runs are often more useful than the perfect ones.

reddit.com
u/sn1pr0s — 1 month ago