r/AutoGPT

▲ 6 r/AutoGPT+1 crossposts

How do you handle agents that need 200+ tool calls per task? We tried one approach, looking for critique

Working on agent chains here, so this is the first sub I wanted to bring this to. Disclosure: I work at MiroMind, this is our checkpoint but I am posting because the design tradeoff is the interesting part, not the brand.

The problem we kept hitting on deep-research chains:

  1. Long horizons. Real research tasks routinely cross 100+ tool calls. Most agent frameworks degrade hard past 50 because of context drift and tool-result noise.
  2. Disconnects. A 20-minute run that dies on socket reset is an expensive way to learn your retry logic is broken.
  3. Trace amnesia. You finish a run, the answer is wrong, and you have no way to see at which tool call the chain went sideways.

What we tried with MiroThinker 1.7 deep-research: - A single run can execute up to 300 tool interactions within a 256K context window, using recency-based retention (only the latest K tool results stay in-context). Not "everything must live in one fragile HTTP session."

Submit / resume / cancel are first-class, the agent keeps executing on our side, you reconnect to it - Every step is logged. Useful when a chain fails on step 187 of 240 and you need to know why Numbers if useful for the architecture choice.

Things I am still unsure about: - Whether the 300 tool-call ceiling is actually the right shape, or whether most of you cap chains way before that and use sub-agents instead

- How you handle resumable execution today

— are you rolling your own job queue, or is there a pattern I am missing?

Would love war stories from anyone running long chains in production.
BTW API Launch pricing is 25 percent off, pre-freeze billing means if the platform fails you do not pay.

reddit.com
u/MiroMindAI — 1 day ago
▲ 29 r/AutoGPT+18 crossposts

We've been building AI agent infrastructure for production use cases and kept hitting the same wall: prompt-level guardrails aren't sufficient for reliable agents.

LLMs drift. As context grows in multi-step pipelines, the model's behavior diverges from what you intended — even with carefully written system prompts. There's no enforcement layer that actually catches this.

So we built one: **Caliber** — an open-source proxy that intercepts every LLM API call and validates behavior against declarative rules, at the infrastructure layer.

**What it does:**

- Intercepts all LLM API calls (OpenAI, Anthropic, any compatible endpoint)

- Enforces behavioral rules on every request/response

- Works with LangChain, AutoGen, or any Python/JS agent framework

- Raises structured exceptions your agent pipeline can handle gracefully

- Self-hostable, no telemetry

**GitHub:** https://github.com/caliber-ai-org/ai-setup

We just crossed 700 stars and nearly 100 forks from the open-source community. Super grateful for the response — but we're still early and want more feedback.

If you're building agents: what behavioral constraints are hardest to enforce reliably right now? What would you want to configure at the infrastructure layer vs. the prompt layer?

u/Substantial-Cost-429 — 11 days ago
▲ 26 r/AutoGPT+1 crossposts

I set up 7 AI coding agents on a VPS with automated cron sessions. Each uses a different model (Claude Sonnet, GPT-5.4, Gemini 2.5 Pro, DeepSeek V4, Kimi K2.6, MiMo V2.5, GLM-5.1). They build startups autonomously with a $100 budget. I handle distribution but never write code.

The biggest finding after 2 weeks: the only agent that received real community feedback (Kimi, from a Reddit post on r/PostgreSQL) is now ranked #1. It got 4 technical questions and shipped a feature for every single one:

  • "How does it handle renames?" -> Built rename detection heuristic
  • "What about view dependencies?" -> Built view dependency tracking
  • "But why does this exist?" -> Rewrote landing page positioning
  • "This looks vibe-coded" -> Built architecture transparency page

Every commit message references the Reddit feedback. No other agent has this feedback loop. They all build from AI-generated backlogs in a vacuum.

Other findings:

  • Cheap model sessions produce 88% waste (Codex: 490/557 commits were timestamp updates)
  • Perfectionism is a failure mode (Xiaomi: 14 "final audit" sessions without launching)
  • Building is not shipping (Gemini: 21,799 files, no domain)
  • Zero revenue across all 7 agents after 14 days

Full standings and deep dives: https://aimadetools.com/blog/race-week-2-results/

u/jochenboele — 10 days ago

sharing the patterns that survived after we shipped 5 AI agents to paying clients this year. these are the boring ones that actually work in production, not the demo-day shiny stuff.

context: small dev team, been building custom agents for founders. each one in production with real users.

pattern 1: thin LLM, fat tools.

the LLM should make decisions. tools should do the work. early on we let the LLM 'figure out' how to send a whatsapp message in pure prompt. it would forget steps, mess up formatting. moved to: LLM picks a tool, tool runs deterministic code. error rate dropped about 80%.

pattern 2: explicit state, never trust the context window.

we use a state object stored in postgres or mongo. every step reads from it, every step writes to it. prompts always start with 'current state: {x}'. LLMs get amnesia in long workflows. don't rely on context memory for anything important.

pattern 3: cheap model first, expensive model on retry.

gpt-4 mini or claude haiku for the first attempt. if confidence is low or it fails validation, retry with the bigger model. way less API spend with no real quality drop on the user side.

pattern 4: validation step is non-negotiable.

every agent we shipped has a 'sanity check' step before any real-world action. is this email formatted right? is this trade amount within expected range? without it, you'll send something weird to a real user within the first week.

pattern 5: human in the loop for irreversible stuff.

sending money, deleting data, posting publicly always pause for a human confirm. one client tried to skip this for efficiency and a user almost transferred 10x what they meant to. we put it back the next day.

stack stuff we keep using:

claude api for reasoning, gpt-4 mini for cheap classification

postgres for state, mongo for unstructured logs

bullmq for async jobs

twilio for whatsapp/sms, stripe for payments

the meta pattern across all five: assume the LLM will fail in some way every run. design every step so failure is recoverable. that mindset changed our agents from 'cool demo' to 'something users actually rely on'.

reddit.com
u/Consistent-Arm-875 — 9 days ago