
Stop trying to prompt-engineer your way out of architecture problems. You need a "Harness."
TL;DR: If your AI agent works perfectly in isolation but falls apart in production, your prompts aren't the issue. You are missing a deterministic system architecture—a "harness"—around the LLM. Stop letting the AI decide its own retry logic.
Here's a pattern I keep seeing with "vibe coded" projects that go sideways.
The AI writes clean code. The individual features work. But at some point, the whole thing starts misbehaving in ways nobody can quite explain. An edge case the agent handled wrong three weeks ago keeps recurring. A task that was "done" gets re-attempted.
You can tweak your system prompts forever, and it won't fix it. According to recent 2026 data, 88% of enterprise AI agent projects fail to reach production for exactly this reason.
The developers actually shipping reliable AI products right now aren't writing magical prompts. They are building what Mitchell Hashimoto recently coined as "Harness Engineering."
Here is a breakdown of what that actually means for full-stack builders.
🧠 The Core Concept: Brain vs. Body
"Agent = Model + Harness."
There’s this dangerous assumption in LLM-native development that you can just describe what you want, and the AI handles the orchestration. That is a prayer, not an architecture. Task routing, failure handling, and state management are classical computer science problems. They need to be deterministic.
You have to strictly separate the Brain from the Body:
- The Brain (LLM layer): Only decides what task to tackle next based on context, evaluates if output meets quality criteria, and provides feedback for revisions.
- The Body (Harness layer): Handles absolutely everything else deterministically.
As LLMs get smarter, the harness actually matters more. A 100x more capable model is just 100x more capable of making complex mistakes with confidence. LLMs are incredible at reasoning and judgment, but terrible at consistency and state awareness.
⚙️ The 4 CS Primitives You Can't Skip
If your agent does more than one thing autonomously, you need these basic backend concepts:
- State Machine (The Spine): Every task must be in a known state (
pending,in_progress,done,failed). If you don't track this, your agent will pick up in-progress tasks and double-execute them on every restart. - Idempotency Guards ("Done is Done"): Every operation needs an idempotency key. If a network timeout triggers a retry, your agent shouldn't charge a user's credit card twice.
- DAG (Directed Acyclic Graph): A simple dependency map. Task B cannot run until Task A completes. Without this, your agent will try to write to a database table before the migration has even run.
- Priority & Dead Letter Queues: The harness decides what gets worked on first, not the agent. And when a task fails 3 times, it goes to a dead letter queue so you can actually debug it, rather than just disappearing into the void.
🛠️ The Minimum Viable Harness (For Solo Full-Stack Apps)
You don't need a massive orchestration platform like Temporal or Prefect to start. You just need this:
- 1 Database Table:
id,type,status,payload,attempts,error. This is your state machine. - A Task Dispatcher (Not a Prompt): Write 20 lines of code that queries the DB for the highest-priority
pendingtask and hands it to the agent. The agent does not choose its own work. - Hard-coded Retry Policy: Max 3 attempts, exponential backoff. The agent cannot override this.
- Deterministic Quality Gates: Before code leaves the system, does it compile? Do tests pass? This runs outside the LLM. If it fails, the harness sends it back.
📝 The Architecture-Aware Prompt Structure
When you actually sit down to prompt Claude or GPT, you have to separate what the AI is allowed to decide from what your harness has already decided. I use a strict 4-block template for this:
- Role & Constraints: Explicitly tell the AI it is a "harness-aware engineer." No refactoring untouched code. No installing new dependencies without asking.
- Harness Rules: Inject your deterministic rules right into the context (e.g.,
RETRY_POLICY: max 3 attempts,TASK_STATES: pending -> in_progress). - Task Format: Define the specific task ID, the exact state the system should be in when done, the files in scope, and what is explicitly out of scope.
- Response Shape: Force the AI to output a
[PLAN]first, then[CHANGES], and finally a[VERIFICATION]step with exact commands to run against your quality gates.
If your AI app keeps doing weird things in production, stop messing with your prompts. Build a task table, write a dispatcher, lock down your retry policy, and draw a flowchart.
Curious how you guys are handling this layer. Are you using off-the-shelf stuff like LangGraph, or rolling custom Postgres/Node setups for your state management?
Feel free to check it out here:
👉Harness Engineering: How to Build AI Agents That Don't Break in Production