u/RestingFrames

How I used Claude Code (and Codex) for adversarial review to build my security-first agent gateway
▲ 2 r/AIAssisted+1 crossposts

How I used Claude Code (and Codex) for adversarial review to build my security-first agent gateway

Long-time lurker first time posting. Hey everyone!

So earlier this year, I got pulled into the OpenClaw hype. WHAT?! A local agent that drives your tools, reads your mail, writes files for you? The demos seemed genuinely incredible, people were posting non-stop about it, and I wanted in. I had been working on this problem since last year and was genuinely excited to see that someone had actually solved it. Then around February, Summer Yue, Meta's director of alignment for Superintelligence Labs, posted that her agent had deleted over 200 emails from her inbox. YIKES. She'd told it: "Check this inbox too and suggest what you would archive or delete, don't action until I tell you to." When she pointed it at her real inbox, the volume of data triggered context window compaction, and during that compaction the agent "lost" her original safety instruction. She had to physically run to her computer and kill the process to stop it. That should literally NEVER be the case with any software ever.

This is a person whose actual job is AI alignment, at Meta's superintelligence lab, who could not stop an agent from deleting her email. The agent's own memory management quietly summarized away the "don't act without permission" instruction, treated the task as authorized, and started speed-running deletions. She had to kill the host process.

That's when I sort of went down the rabbit hole, not because Yue did anything wrong, but because the failure mode was actually architectural and I knew that in my gut. Guess what I found? Yep. Tons more instances of this sort of thing happening. Over and over. Why? Because the safety constraint was just a prompt. It's obvious, isn't it? It's LLM 101. Prompts can be summarized away. Prompts can be misread. Prompts are fucking NOT a security boundary. And yet every agent framework I have ever seen seems to be treating them as one.

I went and read the OpenClaw source code, which I should have done to begin with. What I found was a pattern I think a lot of agent frameworks have fallen into:

- Tool names sit in the model context, so the model can guess or forge them

- "Dangerous mode" is one config flag away from default

- Memory management has no concept of instruction priority

- The audit story is mostly "the model thought it should"

I went looking for a security-first alternative I could trust, anything that was really being talked about or at a bare minimum attempted to address the security concerns I had. I couldn't find one.

So I made it myself.

CrabMeat is what came out of that, what I WANTED to exist. v0.1.0 dropped yesterday. Apache 2.0. WebSocket gateway for agentic LLM workloads. One design thesis:

The LLM never holds the security boundary.

What that means in code:

Capability ID indirection. The model doesn't see real tool names. It sees per-session HMAC-derived opaque IDs (cap_a4f9e2b71c83). It can't guess or forge a tool name because it doesn't know any tool names.

Effect classes. Every tool declares a class (read, write, exec, network). Every agent declares which classes it can use. The check is a pure function with no runtime state, easy to test exhaustively, hard to bypass.

IRONCLAD_CONTEXT. Critical safety instructions are pinned to the top of the context window and explicitly marked as non-compactable. The Yue failure mode, compaction silently stripping the safety constraint, cannot happen by construction. The compactor literally cannot touch them.

Tamper-evident audit chain. Every tool call, every privileged operation, every scheduler run enters the same SHA-256 hash-chained log. If something happens, you can prove what happened. If the chain is tampered with, you can prove that too.

Streaming output leak filter. Secrets are caught mid-stream across token boundaries, capability IDs, API keys, JWTs, PEM blocks redacted before they reach the client.

No YOLO mode. There is no global "trust the LLM with everything" switch. There never will be. Expanded reach comes through named scoped roots that are explicit, audit-logged, and bounded.

The README has 15 'always-on' protections in a table. None of them can be turned off by config, because these things being toggleable is how the ecosystem ended up where it is.

I decided to make sure that this wasn't just a 'trend hopping' project and aligned with my own personal values as well. I built this to be secure and local-first by default. Configured for Ollama / LM Studio / vLLM out of the box. Anthropic and OpenAI work too but require explicit configuration. There is no "happy path" that silently ships your prompts to a cloud endpoint. I decided that FIRST it needed to only run as an email agent with a CLI. Bidirectional IMAP + SMTP with allowlisted senders, threading preserved, attachments handled. This is the use case that bit Yue and a lot of other people, and I wanted to prove it could be done with real boundaries.

I added in 30+ built-in tools of my own. File ops, shell (denylisted, output-capped, CWD-locked), web fetch with SSRF protection, browser, PDF extraction, persistent memory, scheduler. All effect-classified, all dry-run-supported, all audit-logged.

Finally, I created a single-file Windows installer so you can literally download, set up, and use in like, five minutes. PySide6 wizard handles Node install, config generation, the works. End user needs nothing preinstalled. Linux/WSL is two-terminal manual right now; that's a v0.1.1 cleanup.

CrabMeat was built with Claude Code. I want to be specific about that because "I used an AI" is a meaningless statement and "Claude Code wrote my project" is usually a lie. What's actually true is that this project would not exist in this shape, on this timeline, without a workflow built around Claude Code as a core tool, and I think the workflow is worth describing, because it really pushes away from the idea of 'I just told it to build the thing and it did'. It was genuine work to get it finished.

The core loop I landed on uses Claude Code for architectural work and patching, and separate models (Codex / DeepSeek) for adversarial red-teaming and audits against the same codebase. Claude Code is good at building correctly. A different model under different prompting is better at attacking what was built (Codex specifically was REALLY good at this). Running them against each other on every security-relevant subsystem found three critical silent-failure bugs in an earlier project of mine (SIGIL) that I never would have caught with one model alone and that pattern became the audit playbook I used for CrabMeat's security surface. The bugs Claude Code patched, Codex tried to break, Claude Code patched again, repeat until clean.

I keep a single global instruction file (CLAUDE.md) that defines how Claude Code interacts with my projects, code style, commit message conventions, what counts as "done," when to ask before acting. This file is the closest thing I have to a senior-engineer voice in the room. It catches a lot of "you didn't ask if I wanted this" moments before they happen and it saves me literally millions of tokens of reiterations, debugging, hallucenations, and confusion.

I built up roughly 21 reusable Claude Code skills over the course of CrabMeat and adjacent projects. None of these are taken from anywhere else. They're specific to my own workflow, not something generic. "Run the security audit playbook." "Generate a release changelog from git log." "Verify a published release against its tag." The skills are what turn one-off prompts into a real pipeline. As an aside, this was a formalization of a method I had been using for awhile, realizing it was 'official' now let me dump everything into an official channel. Absolute perfection. *chefs kiss*

Parallel Claude Code instances ran on independent subsystems. For the heavy work, I ran multiple Claude Code instances overnight against different parts of the codebase, one on the email connector, one on the audit chain, one on the launcher build, etc. This is only safe because each subsystem has clear boundaries and its own test surface, and because the audit chain catches drift between them. It never edits or changes anything in the codebase, only audits and then writes me a detailed report in markdown. Every security-relevant PR goes through a deliberate "now break this" pass before it lands. Sometimes that's me, sometimes that's a fresh Claude Code instance with adversarial prompting, sometimes that's Codex. The point is the pass exists and it's structured, not vibes. None of this is vibes. Everything is deliberate.

What Claude Code didn't do: it didn't want the program to exist, it didn't design the architecture, it didn't make the security decisions, it didn't make decisions for me, and it didn't write the threat model. The thesis... the "LLM never holds the security boundary"... is mine, and Claude Code's job was to help me implement it cleanly and catch my own mistakes. Which, let's be honest, are a lot. The relationship that works for me is "Claude Code is a very capable engineer on my team who needs clear specs and code review." The relationship that doesn't work is "Claude Code is a magic project generator." If you treat it as the second, you ship something that looks finished but isn't. It absolutely is not that and when I stop LEARNING from using it, I might as well stop using it entirely.

The honest take: I write better code with Claude Code in the loop than without. Specifically, I write more thorough code. Better tested, better commented, more defensively structured. Because the cost of doing it RIGHT dropped and the cost of skipping it stayed the same. That's the productivity gain. I don't think it makes me "10x faster," it is how I actually finish the boring 30% that I used to skip. If you're using Claude Code for serious projects and not already doing the adversarial-second-model thing, try it. It's the single highest-leverage change I've made to my workflow this year.

This is v0.1.0 and calling it 1.0 would be a lie. The README has an honest four-tier stability table: "Stable, beta, experimental, not-recommended-for-network-exposed." The core loop and security rails are stable. Some subsystems are beta. A few are experimental. No part of it is 1.0-mature and I'm not going to pretend it is.

It has not been formally audited. I'd love red team reports. SECURITY.md has a coordinated disclosure path. This is a passion project. I'd rather have ten people running it carefully than ten thousand running it like OpenClaw got run.

The repo: https://github.com/mr-gl00m/crabmeat

Happy to answer questions, hear what I got wrong, or get torn apart in the comments. This is the first time most of this work has been seen outside my own machine and I'd rather find the holes now than later.

— Cid

u/RestingFrames — 3 hours ago