
Rules will always be broken by humans so AI will too: the case for hard gates
Whenever humans are under stress, rules go out the window, just ask any day trader. An agent optimized on the summation of human behavior will do the same thing, not because it's malicious, but because that's the mathematical path of least resistance.
We already have a real example: a Claude-powered Cursor agent deleted the production database for PocketOS, a car rental SaaS, after deciding unilaterally that deleting a staging volume would "fix" a credential mismatch. It guessed wrong. The deletion cascaded to backups. Three months of reservation data including active rentals was gone. The agent's own post-incident summary: "I guessed instead of verifying. I ran a destructive action without being asked. I didn't understand what I was doing before doing it." No rule was broken intentionally. The optimization just found a shorter path. That's not a safety failure. That's a Validator Independence failure the generator evaluated its own action and got it wrong.
Terror Management Theory explains why this is structural, not accidental. When any system faces entropy or failure, it stops optimizing for the global objective and starts optimizing for immediate local survival. In humans this looks like tribalism or . Different substrate, same basin.
The simple proposal
AI generation needs to be separated from execution. The soap bubble is the visual: a soap film can't hold a complex shape on its own no matter how good its instructions are. It needs a rigid physical frame. Right now we're giving the soap film better prompts and calling it alignment.
The frame looks like three hard gates:
Validator Independence — the system that generates the action cannot be the system that evaluates it. A recursive loop where the generator checks its own output is a single point of failure. PocketOS is what that failure looks like in production.
Reversibility Gates — any action crossing an irreversible state boundary (API calls, database writes, financial transactions) is held in a buffer until a deterministic check confirms it traces back to the original objective. Not a prompt. A hard interrupt. A database deletion should never have been executable without one.
Objective Divergence Checks — local optimization cannot be allowed to destroy the global objective. The PocketOS agent wasn't trying to cause harm. It was trying to fix a credential mismatch. The local objective ate the global one.
Humanity didn't survive by prompting people to be good. We built courts, contracts, and social structures hard gates on human behavior. We need the same thing here.
Summary: not better prompts, but an actual frame where generator is separate from executor.
What are some thought on this?