u/Forward_Regular3768

Cheaper Ways to use Claude Code. Almost 11x Cheaper, to Be Exact. (openSource)

Claude Code used 190,300 cached reads to count TypeScript files in a project. Not analyze them. Count them.

The same task on Nitro: 2,432 cached reads. One-liner bash command, done.

That 75x gap isn't a fluke. It comes from how Claude Code is built.

Before you type your first message, Claude Code loads 5,800+ tokens for its system prompt and 15,700+ tokens for its default tool definitions. You're starting every session with 23,500 tokens of overhead already on the tab. Add a CLAUDE.md file or any Model Context Protocol tools and that number climbs. A "hello" burning 20-30% of your five-hour quota makes more sense once you see that.

Nitro's entire system prompt and tool set comes in at 2,542 tokens. It ships with two tools: Bash and AskUser. That's the whole thing.

npm install -g u/aerovato/nitro

You describe what you want, Nitro generates the shell command:

nitro "find all markdown files except node_modules and count total lines, show top 10"
# → find . -name "node_modules" -prune -o -name "*.md" -print0 | xargs -0 wc -l | sort -rn | head -n 11

nitro "get 10 most recent open gh issues with P1 label, show id and title"
# → gh issue list --search "is:open is:issue label:P1 sort:created-desc" --limit 10 --json number,title --jq '.[] | "\(.number)\t\t\(.title)"'

nitro "compress input.mov to a smaller mp4 (h264) optimized for smaller file"
# → ffmpeg -i input.mov -c:v libx264 -crf 23 -preset medium -c:a aac -b:a 128k output.mp4

Before running anything, Nitro tags each command with a risk level: Read Only, Normal, Dangerous, or Extremely Dangerous. Anything above read-only needs explicit approval. You also get behavioral tags explaining what the command actually does, so you're not approving blind.

Nitro is scoped: bash commands and simple tasks. For a full Claude Code replacement, OpenCode covers more ground. But for the shell workflow, the benchmark gap is real and the token math holds.

Full source: https://github.com/aerovato/nitro

u/Forward_Regular3768 — 18 hours ago

Karpathy's 4 Rules for CLAUDE.md Was #1 on GitHub Trending. Full Guide.

Andrej Karpathy posted 4 rules for Claude Code. A developer turned them into a CLAUDE.md file, published it, and watched coding accuracy jump from 65% to 94%,

As we know Claude Code starts every session blank. No memory of your stack, your decisions, what you ruled out last week, or why you picked one tool over another six months ago. So it guesses. It refactors files you didn't ask it to touch. It suggests tools that break your existing architecture. You end up re-explaining the same context every session.

CLAUDE.md is a plain text file in your project root. Claude Code reads it at session start, every time.

Put these at the top. it will help you fix your time re-feeding the context.

Never open responses with filler phrases like "Great question!", "Of course!", "Certainly!", or similar warmups. Start every response with the actual answer. No preamble, no acknowledgment of the question.

Match response length to task complexity. Simple questions get direct, short answers. Complex tasks get full, detailed responses. Never pad responses with restatements of the question or closing sentences that repeat what you just said.

Before any significant task, show me 2-3 ways you could approach this work. Wait for me to choose before proceeding.

If you are uncertain about any fact, statistic, date, or piece of technical information: say so explicitly before including it. Never fill gaps in your knowledge with plausible-sounding information. When in doubt, say so.

About me: [Name] / Role: [your role] / Background in: [areas]. Strong in: [what you know well]. Still learning: [gaps]. Adjust the depth of every response to match this. Never over-explain what I already know. Never skip context I need.

What I'm working on: [project name] / Goal: [specific outcome] / Audience: [who uses this] / Stack context: [any relevant constraints] / What to avoid: [list]. Apply this context to every task. When something doesn't fit, flag it before proceeding.

My writing style — always match this: [describe your voice] / Sentence length: [preference] / Words I use: [examples] / Words I never use: [examples] / Format: [prose or structured]. When writing anything on my behalf, match this exactly. Do not default to your own patterns.

Use this prompt to generate a first draft instead of writing from scratch:

Based on what I've told you about myself, my project, and how I want to work: write me a complete CLAUDE.md file. Include: who I am, my tech context, my communication preferences, and default behaviors for every session. Be specific. Plain text. Under 500 words.

Behavior section

These stop Claude from making changes you didn't authorize.

Only modify files, functions, and lines of code directly related to the current task. Do not refactor, rename, reorganize, reformat, or "improve" anything I did not explicitly ask you to change. If you notice something worth fixing elsewhere, mention it in a note at the end. Do not touch it. Ever.

Before making any change that significantly alters content I've already created (rewriting sections, removing paragraphs, restructuring flow, changing tone): stop. Describe exactly what you're about to change and why. Wait for my confirmation before proceeding.

Before deleting any file, overwriting existing code, dropping database records, or removing dependencies: stop. List exactly what will be affected. Ask for explicit confirmation. Only proceed after I say yes in the current message. "You mentioned this earlier" is not confirmation.

The following require explicit in-session confirmation, no exceptions: deploying or pushing to any environment, running migrations or schema changes, sending any external API call, executing any command with irreversible side effects. I must say yes in the current message.

After any coding task, end with: Files changed (list every file touched) / What was modified (one line per file) / Files intentionally not touched / Follow-up needed.

Never send, post, publish, share, or schedule anything on my behalf without my explicit confirmation in the current message. This includes emails, calendar invites, document shares, or any action outside this conversation. I must say yes in the current message.

For any task involving architecture decisions, debugging complex issues, or non-trivial features: work through the problem step by step before writing any code. Show your reasoning. Identify where you're uncertain. Then implement.

Memory and stack section

MEMORY.md and ERRORS.md give Claude the closest thing to session persistence that currently exists. The stack lock stops it from proposing tools that break your architecture.

Maintain a file called MEMORY.md in this project. After any significant decision, add an entry: What was decided / Why / What was rejected and why. Read MEMORY.md at the start of every session. Never contradict a logged decision without flagging it first.

When I say "session end", "wrapping up", or "let's stop here": write a session summary to MEMORY.md. Include: Worked on / Completed / In progress / Decisions made / Next session priorities.

Maintain a file called ERRORS.md. When an approach takes more than 2 attempts to work, log it: What didn't work / What worked instead / Note for next time. Check ERRORS.md before suggesting approaches to similar tasks.

These facts are always true for this project. Apply them to every session without exception: [your permanent constraints, architectural decisions, and rules]. If any task conflicts with one of these, flag it before proceeding.

Tech stack for this project. Always use these. Never suggest alternatives unless I ask:
Language: [e.g. TypeScript]
Framework: [e.g. Next.js 14]
Package manager: [e.g. pnpm]
Database: [e.g. PostgreSQL with Prisma]
Testing: [e.g. Vitest]
Styling: [e.g. Tailwind CSS]
If something seems like the wrong tool, flag it. But use the defined stack unless I explicitly say otherwise.

For questions involving system architecture, performance tradeoffs, database design, or long-term technical decisions: use extended thinking mode. Work through the problem step by step. Surface tradeoffs I haven't considered. Flag assumptions that might not hold at scale. Then give your recommendation.

Karpathy's 4 rules

These are the ones that moved accuracy from 65% to 94%. Put them in every CLAUDE.md you set up.

1. Ask, don't assume. If something is unclear, ask before writing a single line. Never make silent assumptions about intent, architecture, or requirements.

2. Simplest solution first. Always implement the simplest thing that could work. Do not add abstractions or flexibility that weren't explicitly requested.

3. Don't touch unrelated code. If a file or function is not directly part of the current task, do not modify it, even if you think it could be improved.

4. Flag uncertainty explicitly. If you are not confident about an approach or technical detail, say so before proceeding. Confidence without certainty causes more damage than admitting a gap.

Start with just these 4. Drop them into a new CLAUDE.md in your project root. Add the rest as you identify what's missing in your workflow.

u/Forward_Regular3768 — 2 days ago

The Harness Is the Product: Claude Code Vs OpenClaw

claude code waits. it runs the context hot until it's within 13,000 tokens of the limit, then summarizes. there's a fallback if the threshold misses. reactive, by design.

openclaw estimates before every call. multiplies by a 1.2 safety margin, routes to the cheapest strategy that fits. it can't do surgical cache edits because it doesn't own the api, so it repairs orphaned tool results instead, blocks that claude code never produces and therefore never has to fix.

Two well-engineered agents at opposite ends of the same design space. That disagreement alone tells you how unsettled this is.

The cheap layer

Claude Code uses a beta API feature called cache_edits to surgically remove specific tool results from the server-side cached prompt without busting the cache. For a 200,000-token cached prefix, a cache miss is expensive, so this lets you evict stale results without paying for it. It runs every turn across a fixed allowlist: Read, Bash, Grep, Glob, Edit, Write, WebSearch, WebFetch.

OpenClaw can't do this because it doesn't own the API. It has a generic guard that shrinks any single tool result over half the context window and throws on overflow past 90%. It also repairs orphaned tool_result blocks after history pruning, because Anthropic's API rejects a request with a tool_result whose matching tool_use was dropped. Claude Code doesn't produce orphans so it skips this entirely.

Both agents drop middle messages as context grows. Neither approach is novel, both are necessary.

here's the explanation

Claude Code makes one large language model call. The full conversation goes in, images and documents stripped to placeholders, and a 9-section summary comes out capped at 20,000 tokens. The call runs through a forked subagent that inherits the parent's prompt cache prefix, so it lands as a cache hit. A comment in compact.ts notes that running without prefix sharing was "98% cache miss, costs ~0.76% of fleet cache_creation (~38B tok/day)."

OpenClaw delegates to the pi-coding-agent SDK and can use a separate cheaper model for compaction. If the conversation is too large for one call, it splits on tool-call boundaries, summarizes each chunk, then runs a merge call. After summarization, an audit step checks that all five required section headings appear in order, that every identifier from the last 10 messages survives in the output, and that the latest user ask overlaps with the summary body. Failed checks trigger regeneration with structured feedback. Claude Code trusts the model to follow the schema and skips the audit.

Failure handling

Claude Code retries twice on streaming failures, up to three times on prompt_too_long by dropping oldest message groups, and has a circuit breaker that stops auto-compaction after three consecutive failures. The comment justifying the budget: "1,279 sessions had 50+ consecutive failures (up to 3,272) in a single session, wasting ~250K API calls/day globally." Production logs, not an architectural review.

OpenClaw snapshots the session file before running compaction. On failure, it degrades gracefully through three strategies: full summarization, summarize small messages and note large ones, note what was there. If none work, it returns { cancel: true } and leaves the transcript untouched.

The harness point

Anthropic called Claude Code "the thinnest possible wrapper over the model." The compaction stack has eight distinct mechanisms, a session-memory subsystem updated by a sandboxed subagent, a circuit breaker calibrated against production failure data, a beta API feature built to avoid cache misses on eviction, and prompt tuning precise enough to track stray tool-call rates to two decimal places.

If you buy your harness from someone else, you inherit every one of those choices without seeing the telemetry behind them. The retry budgets, the summary schema, where the fallback chain bottoms out. None of that is in the readme. Your vendor calibrated those numbers against their fleet, their users, their session patterns. Your failure modes are not their inputs.

Compaction is one slice. The same design depth lives in tool routing, error classification, cache management, and subagent orchestration. You can hand those decisions to someone else, but you can't avoid them.

reddit.com
u/Forward_Regular3768 — 3 days ago
▲ 4 r/AIAgentsInAction+1 crossposts

Agent Hooks: Put Policy in Code, Not Prompt

Agent hooks attach user-defined scripts to specific lifecycle points in an agent session. Here are the Six points that cover most of what developers actually need: session start, prompt submit, before a tool runs, after a tool runs, before the agent signals done, and session end.

The core distinction: prompts ask the model to follow a rule. Hooks run the rule as code at a known moment. If a protected file must never be edited, a PreToolUse hook inspects the path and blocks the write before it happens. If tests must pass before the agent calls a task complete, a Stop hook reads a state file written by a PostToolUse hook and blocks completion when the gate failed. Neither outcome depends on the model remembering an instruction from three turns ago.

The operating model: event, optional matcher, handler, outcome. A PreToolUse hook fires on file edits and shells out to a Python script that checks the path against a denylist. A PostToolUse hook fires after any .py change, runs the test suite, and writes the result to .hook-state/last_quality_gate.json. The Stop hook reads that file. Each stage is explicit; each handoff is a file on disk.

Protected-path check:

python

protected_prefixes = ("generated/", "fixtures/sensitive/", ".git/")
protected_exact = {".env", ".env.local"}

if rel in protected_exact or any(rel.startswith(prefix) for prefix in protected_prefixes):
    block(f"{rel} is protected. Use application code or tests instead.")

Command policy that catches dangerous shell calls before they execute:

python

deny_patterns = [
    (r"\brm\s+-rf\s+(/|\.|~|\$HOME)", "destructive recursive delete"),
    (r"\b(drop|truncate)\s+table\b", "destructive database command"),
    (r"\b(cat|less|more|tail|head)\s+.*\.env\b", "reading env files"),
    (r"(>\s*|tee\s+|cat\s+>\s*)(generated/|fixtures/sensitive/|\.env)", "writing protected paths from the shell"),
    (r"deploy\.py\s+production\b", "production deploy"),
]

Context injection at session start and on prompt submit:

python

# SessionStart: inject repo conventions before the model reasons about anything
context = """
Project context for agent-hooks-demo:
- Application code lives in src/.
- Tests live in tests/.
- Run `python3 -m unittest discover -s tests` before calling work complete.
- Do not edit generated/, fixtures/sensitive/, .env, .env.local, .git, or files outside the repo.
- Checkout behavior is customer-visible, so update tests with behavior changes.
""".strip()

python

# UserPromptSubmit: add billing-specific context when the prompt warrants it
if any(term in prompt for term in ["refund", "billing", "invoice", "payment", "checkout"]):
    context = (
        "This request touches checkout or payment behavior. Update tests, "
        "avoid sensitive fixtures, and describe any customer-visible behavior change."
    )

Hooks don't replace model judgment. The model still picks the approach, writes the code, and recovers from failures. Path policy, command policy, test gates, audit logs, completion conditions move out of model memory and into code. Those are the things that should never depend on whether the model remembers an instruction from three turns ago.

A practical starting sequence: add a PreToolUse hook that blocks writes to protected paths, then a PostToolUse hook that runs your fastest test command and writes a state file, then a Stop hook that reads it. That sequence alone cuts most of the repeated reminders and post-hoc cleanup that slow down agent workflows.

If a requirement uses "always," "never," "block," "record," or "verify," put it in a hook

reddit.com
u/Forward_Regular3768 — 4 days ago

How to Build AI Agents in 2026. Full Breakdown

Most agent frameworks just wrap an API call. keep agentic-harness as a runtime. The difference shows up the first time a session runs for two hours, a sandbox drops mid-task, or the context window fills without compaction configured.

Two Rust crates, one binary. cargo build is the full pipeline.

The three-layer architecture

  • Outer ring: your handlers. Receive an AgentContext, call sessions, read and write files, run shell commands, spawn tasks.
  • Middle ring: the harness. Handles session persistence, compaction, role discovery, and provider swaps without touching your handler code.
  • Inner ring: execution targets. Local filesystem, CI checkout, remote sandbox over HTTP, Cloudflare Worker boundary. Handlers call session.shell() regardless of which target is underneath.

Session identity

  • Agent identity is a URL path, not a registry key: POST /agents/<name>/<id>
  • Same ID continues the conversation. Different ID starts fresh.
  • No session creation endpoint. No UUID you manage yourself.

Sessions

  • Hold message history, file access, shell execution, tool registrations, role assignment, and compaction budget.
  • History accumulates automatically across HTTP calls on the same ID.
  • Read files and format them inline when you need large context alongside a prompt.

Tasks

  • A task is a one-shot child session with fresh history and shared workspace.
  • The parent's history never sees the task's intermediate reasoning.
  • Run exploratory analysis inside a long session and the model anchors on noise from 40 messages ago. Tasks are the fix.
  • The threshold for "make it a task" is lower than it feels.

Roles and skills

  • Roles are system-prompt overlays scoped to a single call, defined in markdown, auto-discovered at startup.
  • Role frontmatter pins a specific model per role. Your explainer runs on Sonnet, your security auditor on Opus.
  • Skills are markdown files the model reads before every session. Edit the file, behavior updates on the next run, no recompile.

HttpSessionEnv

  • The binary runs locally or in CI. File and shell operations execute inside a remote sandbox. The agent doesn't know which.
  • Every session.shell() is a network round trip. Tight loops accumulate latency fast.
  • Fix: write a script to the sandbox and run it once per iteration.

​

session.write("/workspace/agent-check.sh", r#"
#!/bin/bash
set -e
cargo fmt --check 2>&1
cargo clippy -- -D warnings 2>&1
cargo test --no-fail-fast 2>&1 | tail -100
"#)?;
let result = session.shell("/workspace/agent-check.sh")?;

Build targets

  • --target native: self-contained binary, nothing required on the target machine.
  • --target node: native binary wrapped by a 30-line Node HTTP shim, for platforms that require a Node entrypoint.
  • --target cloudflare: Worker boundary plus WebAssembly adapter. No real filesystem, no long-running shell. Webhook handling and routing only.

Compaction

  • Fires automatically when session history fills the context budget.
  • Summarizes the middle of the conversation, preserves the tail verbatim.
  • Summaries lose precision. Write load-bearing decisions to files before they get compacted.

​

session.write(
    ".agentic-harness/decisions.md",
    "chose authlib over fastapi-users.\n\
     reason: only library with PKCE support compatible with axum middleware.\n\
     do not revisit.\n"
)?;
  • Files survive compaction. History doesn't.
  • Run agentic-harness doctor to get your model's actual context window. Set your budget to 80-90% of that.

Here's Full walkthrough with every code block, CI pattern, schema-guided output, Model Context Protocol wiring, and failure mode breakdown in the complete article.

reddit.com
u/Forward_Regular3768 — 5 days ago

Split Your CLAUDE.md Before It Hits 400 Lines. Here’s how

A CLAUDE.md file starts small. but meanwhile Build commands, a naming convention, a note about the test runner. Then API guidelines move in, then security rules, then React patterns and database warnings. Six months later it's 400 lines and every line loads into context every session, whether Claude is touching a migration file or a button component.

That's priority saturation. Migration safety rules compete with CSS conventions for attention on tasks where neither applies.

Here's my Claude Code's rules system fixes this with path-scoped instruction files.

How it works

Rules live in .claude/rules/ at two levels: project rules committed to git and shared with the team, and personal user rules at ~/.claude/rules/ that apply across every project you work on.

A rule file without paths: frontmatter loads at session start, same as CLAUDE.md. A rule file with paths: frontmatter loads only when Claude reads a matching file:

---
paths:
  - "prisma/migrations/**/*"
---

# Migration Safety Rules
- Always include rollback instructions
- Never delete columns in the same migration that removes code using them
- Add columns as nullable first, populate, then add constraints

When Claude edits a React component, those migration rules never enter context. The glob patterns follow standard syntax, with ** for any directory depth and brace expansion for multiple extensions:

paths:
  - "src/api/**/*.ts"

paths:
  - "**/*.test.ts"
  - "**/*.spec.ts"

paths:
  - "src/**/*.{ts,tsx}"

What to scope and what to leave always-on

Path-scoped rules load when Claude reads a matching file, not when you mention one. If Claude skips the read step and goes straight to editing, the rule may not load. For safety-critical instructions, skip the paths: filter so they always load. Use scoping for style conventions and domain-specific patterns where missing the rule isn't catastrophic.

A practical layout for a TypeScript project:

.claude/
├── CLAUDE.md                     # Build commands, architecture overview, key conventions
└── rules/
    ├── code-style.md             # No paths; universal style rules
    ├── api-patterns.md           # paths: src/api/**/*.ts
    ├── testing.md                # paths: **/*.test.ts, **/*.spec.ts
    ├── database.md               # paths: prisma/**, src/db/**/*
    ├── security.md               # No paths; always loaded (safety-critical)
    └── frontend/
        ├── react.md              # paths: src/components/**/*.tsx
        └── styling.md            # paths: **/*.css, **/*.scss

Rules vs. CLAUDE.md vs. auto memory

CLAUDE.md is for essential session context: build commands, architecture overview, a handful of critical conventions. Keep it under 200 lines.

Rules are modular standards scoped by domain or file type.

Auto memory is what Claude writes based on working with you: debugging quirks, build edge cases, personal preferences. Audit it periodically but don't author it yourself.

If two rule files contradict each other, Claude picks one without warning. Audit your rules when you add new ones and keep each file to a single topic.

The setup takes ten minutes. Pull domain-specific content out of CLAUDE.md into focused .claude/rules/ files, add paths: frontmatter where it makes sense, and keep safety-critical rules always-on.

u/Forward_Regular3768 — 6 days ago

I Automated Knowledge Base With Codex and Obsidian

The Goal: Bookmark something on X. Every morning, your agent has pulled it into Obsidian, processed it, and updated its own context. You don't have to touch it.

5-folder structure (organized for the agent, not for you)

  • AGENTS.md: First file your agent reads. Your identity, 2026 goals, and hard rules ("never give boilerplate," "always check /notes before suggesting anything").
  • inbox/: Staging area. Articles, voice notes, raw content land here with no tagging.
  • notes/: Processed knowledge. API schemas, framework deep-dives, research papers. Source of truth.
  • ideas/: Your original thinking and working logic. This stops the agent from giving generic answers.
  • projects/: Active work. Sits next to notes/, so the agent can pull a paper from three months ago and apply it to code you're shipping today.

Automation prompts (copy verbatim)

X Bookmarks extraction:

"use  Navigate to my X Bookmarks. Extract the content of every thread saved in the last 24 hours. Strip out the ads and engagement bait. Convert the core insights into a clean Markdown file titled YYYY-MM-DD-X-Insights.md and save it to my /inbox."

YouTube Watch Later:

"Access my YouTube with u/computer 'Watch Later' list. For every video added today, pull the full transcript. Use your internal logic to summarize the technical 'How-To' or 'Big Idea' from each. Save these as individual Markdown files in my /inbox so I can search them locally."

Daily audit:

"Audit all new files in the /inbox and /notes from the last 24 hours. Cross-reference them with our roadmap in AGENTS.md.
MEMORY ENHANCEMENT: Identify new technical patterns or logic I need to internalize for our current projects.
STRATEGIC SHIFT: Does this new information suggest a better way to execute our current tasks? If so, flag the contradiction.
IMMEDIATE ACTION: Based on these upgrades, what is the single most high-leverage task for today? Update your internal memory and save the strategy to DAILY-BRIEF.md."

Weekly firmware update:

"Analyze all Daily Briefs and new intelligence from the past 7 days.
EMERGING THESIS: What high-level skill or concept have we mastered this week?
ARCHITECTURAL EVOLUTION: Reorganize our folders and concepts to reflect our current level of understanding. Create new specialized directories if our focus has shifted.
FIRMWARE UPGRADE: Rewrite the 'Core Logic' section of AGENTS.md. Integrate everything we've learned this week so you start Monday morning as a more capable, more senior version of yourself."

Keeping the agent accurate as the vault grows

Large vaults make agents sloppy. Three rules that prevent it:

  • No technical decision without citing a specific file from /notes. No citation means the agent says it's guessing.
  • Before writing code, the agent writes a 3-sentence plan grounded in AGENTS.md.
  • If a task contradicts a saved note, the agent stops and asks. No workarounds.

The ideas/ folder does more work than any other layer. Your stored logic there is the only thing between a generic answer and one that fits how you actually build.

reddit.com
u/Forward_Regular3768 — 8 days ago

Claude Design vs. Google Stitch: I Used Both. They Solve Completely Different Problems.

Google dropped a major Stitch update in March 2026. Anthropic launched Claude Design on April 17. I spent time with both(because the manager wanted us to start implementing AI extensively in our workflow). They're not really competing for the same user.

The core difference

Stitch is a spatial canvas. You describe a UI, it generates screens, you drag things around on an infinite board. Up to 5 connected screens at once, export to Figma or Firebase Studio, powered by Gemini, free at 350 generations per month.

Claude Design is conversation-first. You prompt, Claude builds, you refine through comments and inline edits. It reads your codebase during onboarding and assembles your design system automatically. Every project after that inherits your brand by default.

If you think in layout and space, Stitch clicks immediately. If you work from a spec or a brief, Claude Design fits better.

Where they diverge in practice

Stitch handles multi-screen prototyping faster on a first pass. You get 5 linked screens, a clickable prototype, and the ability to auto-generate next screens from any existing one.

Claude Design's collaboration model is more developed. Colleagues can edit and chat with Claude in the same conversation thread. The full design history is shared, not just the exported file. For async cross-functional reviews, that context matters.

The developer handoff is where the Claude ecosystem consolidation shows up. Stitch exports clean HTML, CSS, and React and pipes directly to Firebase Studio via the Model Context Protocol. Claude Design outputs a handoff bundle you pass to Claude Code in one instruction. If you're already in that workflow, that's a meaningful shortcut.

Export targets tell you who each tool serves

Stitch exports: HTML/CSS, React, Figma, Firebase Studio.

Claude Design exports: PDF, PPTX, Canva, standalone HTML, folder download, Claude Code handoff bundle.

Stitch is built around developers. Claude Design covers designers, marketers, and developers in the same export menu. The audience assumption is baked into the output options.

When to use which

Stitch makes sense for individual work, fast UI exploration, or anything where you need a free tier that doesn't require an existing subscription.

Claude Design makes sense when your team already pays for Claude, you have a design system worth preserving, and you want design, collaboration, and code handoff inside one conversation. The subscription model reflects that positioning.

One thing the comparison doesn't surface: the onboarding gap. Stitch starts generating the moment you type. Claude Design's codebase-reading setup takes longer but pays out across every subsequent project in that workspace. For one-off work, that's friction. For ongoing product work, it's the feature.

reddit.com
u/Forward_Regular3768 — 9 days ago

How I Cut Token Usage 98.7% Using Anthropic's Code Execution API

The MCP-versus-command-line debate ran most of 2025. its actually a Wrong Fight.

The critics measured real numbers. Playwright MCP loads 13.7K tokens. Chrome DevTools loads 18K. Five servers together burn 55K tokens before the agent does anything.

The defenders had a point too. Command-line interfaces break on multi-tenant apps. Without typed contracts, the agent guesses at output shapes and wastes turns on unfamiliar APIs.

Both sides diagnosed symptoms. but Neither named the problem.

The loading habit. Every tool schema into context at session start. Every tool response piped back through the model. A single workflow hits 150K tokens not because MCP is broken but because you loaded everything upfront and kept it there.

Anthropic's fix, published November 2025: let the model write code that calls tools through a runtime instead of calling tools through context. The model only sees what it imports.

Their example: Google Drive transcript into Salesforce update. Old approach loaded both tool schemas and piped the transcript through the model twice. New approach is a few lines of TypeScript. Same result, 2K tokens. 98.7% reduction.

Cloudflare took it further. 2,500 endpoints, 1.17M tokens of schemas if you load the full catalog. They collapsed it to 1K tokens with two functions: search and execute. The agent writes code that searches for what it needs, then runs only that.

Two primitives.

Bash for anything with a binary already on the system. Git, curl, grep. Finding every Python file that imports pandas:

grep -r "import pandas" --include="*.py" .

No tool definition. The shell does the work.

Typed module imports for proprietary APIs. Small TypeScript files, one per tool, inputs and outputs declared. The agent pulls only what the task requires:

import { searchFiles } from "@tools/github";
import { sendMessage } from "@tools/slack";

const files = await searchFiles({ pattern: "*.py", path: "./src" });
const summary = files.map(f => f.path).join("\n");

await sendMessage({
  channel: "#engineering",
  text: `Found ${files.length} Python files:\n${summary}`,
});

GitHub and Slack schemas enter context only on those import lines. Everything else the runtime offers stays out. The file list processes in code. The model never sees raw paths, only the summary the code built.

Typed contracts from the protocol side. Lazy loading from the command-line side. Both inside one runtime. The agent picks per task. File search is bash. Salesforce update is a typed import. Same workflow, both in a few lines.

Most agent token waste isn't architectural. It's loading the full toolbox at session start and keeping it there.

Tool definitions belong in code. The model writes the calls. The runtime handles the rest.

reddit.com
u/Forward_Regular3768 — 10 days ago

Hermes Agent + Claude Code is a real superpower. Here is the Simple Setup

Hermes is not a Claude Code competitor. Run it next to Claude Code or Codex. Those handle the keystrokes. Hermes handles what has to survive after the chat closes.

Three layers shifted at once, which is why the install is worth it.

DeepSeek V4-Pro went live April 24, 2026 at 1M context, 75% off through May 31 at $0.435 input and $0.87 output per million. MiMo-V2.5-Pro lands at $1 input and $3 output per million on OpenRouter under MIT license, also at 1M context. It scores 78.9 on SWE-bench Verified and 68.4 on TerminalBench 2.0. OpenCode Go bundles both behind one key, $5 first month, $10 after.

Claude Code and Codex are the foreground coders most teams already use.

Hermes v0.13.0, the Tenacity Release, shipped May 7 as tag v2026.5.7. 864 commits, 588 pull requests, 295 contributors, 8 closed P0 security issues in 7 days.

What v0.13 ships

Durable Kanban. A task board on local SQLite. Workers run as full operating system processes with their own identity. Heartbeats prove liveness, missed heartbeats trigger reclaim, zombie detection catches workers that stopped responding, per-task retry budgets stop infinite loops. A hallucination-recovery gate verifies completion claims before tasks close, which catches workers reporting "done" on patches that never landed.

Persistent /goal. /goal <text> sets a standing objective and runs the first turn. After each turn an auxiliary judge model checks whether the goal is complete. If not, the agent continues, capped at 20 turns. State lives in the session database and survives /resume. Subcommands: /goal status, /goal pause, /goal resume, /goal clear.

Checkpoints v2. Snapshots fire before write_file, patch, and destructive terminal commands. /rollback lists checkpoints. /rollback <N> restores and undoes the last chat turn. /rollback diff <N> previews. /rollback <N> <file> restores one file.

Other v0.13 work. Gateway auto-resume keeps sessions alive across restarts and source-file reloads. Post-write linting fires deltas on Python, JSON, YAML, and TOML writes. 8 P0 security closures, including secret redaction on by default and a CVSS 8.1 cross-guild Discord direct message bypass.

What was already there

Hermes reads .hermes.md, AGENTS.md, CLAUDE.md, .cursorrules, then walks deeper as tools touch new directories. 7 terminal backends: local, Docker, Secure Shell, Singularity, Modal, Daytona, Vercel Sandbox. 26+ providers including Anthropic, OpenCode Go, OpenRouter, GitHub Copilot, and any OpenAI-compatible endpoint. Cron jobs run as full agent runs or no_agent script-only watchdogs at zero large language model token cost. Credential pools rotate on rate limits. Hermes also exposes itself as a Model Context Protocol server, so Claude Code or Cursor can read its messaging state. delegate_task covers short synchronous subtasks where Kanban's durability is overkill.

The weekend test

One repo. One real test-failure cleanup. Drop in:

cd /path/to/your/repo
hermes

If AGENTS.md, CLAUDE.md, or .cursorrules exists, Hermes loads it on turn one. Otherwise write a short .hermes.md with the test command, the code-style conventions, and the rule for when to fix the test versus the code.

Set the goal:

/goal Fix the failing tests in this repo. Run the test command, identify the smallest failing surface, patch one safe change at a time, stop when the test command passes. Use checkpoints before risky edits.

Each write_file or patch lands a checkpoint first. Each Python, JSON, YAML, or TOML write triggers a delta lint. The judge model checks goal completion after each turn. A bad edit rolls back with /rollback <N>, or you preview the diff first with /rollback diff <N>.

For parallel failures, stand up a board:

hermes kanban init
hermes kanban create "Fix failing auth tests" --workspace dir:/absolute/path/to/repo
hermes kanban dispatch --max 1
hermes kanban watch

Each task spawns a worker as a separate operating system process.

For the code-writing itself, delegate. Hermes ships bundled skills that hand work to Claude Code, Codex, or OpenCode in one-shot print mode or an interactive tmux session. Hermes keeps the harness. The foreground worker writes the code.

Use OpenCode Go as the provider. The $5 first-month subscription bundles DeepSeek V4-Pro and MiMo-V2.5-Pro behind one key. Spend caps: $12 per 5-hour window, $30 per week, $60 per month, so a fallback key matters for long cleanup runs.

By Monday: one repo, one goal, one model route, checkpoints firing, one failing test path closed without losing the thread.

Install

macOS or Linux only. Windows users go through Windows Subsystem for Linux 2.

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
source ~/.zshrc


hermes setup
hermes model      # pick provider opencode-go
hermes doctor

Where it breaks

macOS users on Python 3.13 hit installer conflicts. Docker users hit Node version mismatches and .venv permission issues. Remote terminal backends have a known environment-variable passthrough bug. Discord lacks proxy support for restricted networks. Native Windows is not supported yet. /goal uses a judge model, so completion checks can fire wrong in either direction.

This is Tenacity, not Production. Treat the install as a serious test, not a deployment.

The next jump in coding agents probably won't be a smarter model. It will be the runtime that keeps the work alive after the chat window closes. v0.13 is the strongest installable version of that bet I have seen.

u/Forward_Regular3768 — 11 days ago
▲ 5 r/AIAgentsInAction+1 crossposts

How /review and /ultrareview changed how I ship code

Claude Code ships two review commands. The split is simple: /review for daily work, /ultrareview before anything merges into main.

/review runs locally. Fast, cheap, low token spend. I run it after every meaningful change and it catches the obvious stuff.

/ultrareview runs in the cloud. It spins up parallel agents that each inspect the codebase from a different angle, then merges findings into one report. Costs five to ten dollars per run and takes five to ten minutes. Use it on complex changes, skip it on a one-file tweak.

How /ultrareview works

It's diff-based, so your project needs to be in a Git repo. The command reads your current branch against the default branch, the changed files, and commit history.

You can point it at the working state or at a specific pull request:

/ultrareview <PR number>

# Reviewing a PR by full link
/ultrareview https://github.com/org/repo/pull/123

# Reviewing a PR by number
/ultrareview 123

When you pass a pull request, Claude spins up a cloud sandbox, clones the branch from GitHub, diffs against base, and runs the deep review there.

What I tested it on

A SaaS landing page built in React and TailwindCSS. I asked Claude to add a sign-up form that collects emails for a follow-up sequence. Claude wrote the form, then validated its own implementation before handing it back, which already trims the post-merge bug surface.

Then I ran /ultrareview on the result.

The terminal warns you about time and cost upfront, then opens a web session with a live progress link. One thing to know: the session page doesn't auto-refresh. If it looks stuck on the verify step, hit refresh. The review finished, the page just didn't catch up.

Output comes back in two parts. Bugs found in the new feature, and the patches Claude proposed for each. Terminal shows the summary, the web session shows the full report.

When the deeper review pays off

On a landing page this size, /review and /ultrareview flagged the same bugs. The deeper pass didn't add much.

The payoff shows up on codebases with dozens of directories, where one change ripples through files you forgot existed. The parallel agents each take a different slice (security, logic, performance, integration), and the breadth is what the five to ten dollars buys you.

For prototypes and small features, stick with /review. Run /ultrareview once before merging anything you're shipping to production.

reddit.com
u/Forward_Regular3768 — 12 days ago
▲ 27 r/AIAgentsInAction+1 crossposts

A folder on your machine is the unit of work in Codex. Not the chat window, not the model toggle. The folder. Open it in Codex and the agent can read, write, and edit any file inside. Open the same folder in Claude Code later and the work picks up where you left it.

I built a comment intelligence system in one session to test how far that goes. Pulled 200+ comments from my channel, analyzed them in Excel, deployed the dashboard to a live URL, scheduled a weekly refresh. The flow below is what worked.

Foundations: agents.md, plan mode, .env.local

Three things sit at the start of every project. An agents.md file at the project root (the equivalent of claude.md if you're coming from Claude Code), plan mode toggled on, and a .env.local for keys.

Skip writing agents.md by hand the first time. Describe the project to Codex in plain English and ask it to draft the file. It produces sections for context, project goal, product direction, and constraints. You read it, edit it, save.

Plan mode is what stops execution drift. The agent brainstorms, asks clarifying questions, produces a plan, waits. Submit the plan and it executes. Skip plan mode and you spend the rest of the session unwinding decisions Codex made on its own.

The leading dot on .env.local tells Codex to exclude the file from public commits when you push to GitHub. Paste keys there. Avoid stray text files for credentials.

Connecting to something without a plugin

YouTube has no native plugin. In plan mode I told Codex the goal: pull recent comments from my channel. It came back with three paths (application programming interface key, OAuth, both), I picked the API key route. It walked through Google Cloud setup, enabling YouTube Data API v3, and creating the key.

The first connection test failed on PowerShell with a transport layer security issue. Codex tried Node, then Python, both worked. I asked it to save the lesson into project memory so it stops trying the broken path. That feedback loop is what makes the agent better over time. You hit a wall, you tell Codex to remember it, the next session avoids the wall.

The deliverable

I asked for an Excel workbook of the 200 most recent comments across recent videos with charts and creator-focused insights. Plan mode asked clarifying questions, then ran.

The workbook came back with content category mix, tool mention rankings, question rate, content-requested rate, frequent question patterns, reply opportunities ranked by priority, content idea suggestions, and a raw tab with every comment.

The first version was mid. My prompt was vague, so the insights were generic. Specificity in the prompt is what produced useful analysis when I ran it again. Tell Codex which patterns you care about as a creator, which tools you want surfaced, what counts as a high-priority reply. The output sharpens.

Skills are reusable recipes

Once the workbook existed, I asked Codex to reverse-engineer the workflow into a skill. A skill is a markdown file inside the .codex folder that captures the steps. Two storage levels:

  • Global: user-level .codex folder, available in every project.
  • Project-level: stored inside the project itself.

You move between them with one sentence: "move that skill to the project level." You invoke with a slash command (/youtube-comment-insights) or natural language. Both routes work.

The first run of the skill takes the time the original build took. Every run after is one prompt.

Localhost to live URL

I asked Codex to build a dashboard that visualizes the workbook data, with charts, an explorer, and a clean user interface. Before any code, I had it use GPT Image 2 to generate concept art and user interface references and save them as project assets. The dashboard build then references those assets. The output looks sharper than letting the model freestyle the design.

Before handing the dashboard back, Codex ran two visual passes through its built-in browser, found three issues, fixed them.

The dashboard ran on localhost. To put it on the internet:

  1. Authorize Codex with GitHub through the command-line interface.
  2. Tell Codex to create a private repo and push the codebase.
  3. Connect Vercel to the same GitHub account.
  4. Import the repo into Vercel.
  5. Deploy.

Thirty seconds to a live URL. After that, GitHub and Vercel handle the rest. Every push triggers an automatic Vercel deploy. I work in Codex, Codex pushes commits, Vercel updates the live site.

The automation gotcha

Codex has an Automations tab for scheduled chats on a cron. I set one for every Sunday at 5pm: pull fresh comments, update the workbook, refresh dashboard data, push to GitHub, let Vercel deploy.

The first run took 40 minutes when it should have taken 7. Two reasons. The Excel file was open on my desktop, so Codex could not overwrite it. And the automation panel defaulted to GPT-5.2 instead of 5.5. The model picker inside automations does not inherit from your active chat. Set it explicitly per automation. I switched to GPT-5.5 high, closed the workbook, re-ran. Seven minutes.

One more constraint: local automations only run while your machine is on. Close Codex or shut down, the cron pauses. Claude Code recently shipped cloud routines that solve this. Codex has no native equivalent yet, so either keep the machine on or move the workflow to a virtual private server.

Browser use as a quality assurance loop

Codex's in-app browser is the cleanest agent browsing I've used. I told it to open the deployed dashboard, click around, try to break it, report back. It found six issues, including external YouTube links not opening in new tabs, an empty explorer state, and search that was too literal with no fuzzy matching.

You can bake the browser quality assurance pass into the skill itself. Every time the agent ships a new dashboard or feature, it runs the browser pass before returning the work to you.

Browser use also covers anything an application programming interface can't reach. Tools without endpoints, dashboards that don't expose data, a user interface you describe in natural language and let the agent click through.

Conclusion

The folder is the portable unit. Open it in Codex, Claude Code, OpenClaw, or anything else that reads markdown, and the project carries over. Save every failure into project memory so the agent stops repeating it. One hour of setup gave me a workflow that pulls fresh data, analyzes it, deploys to a live URL, refreshes weekly, and tests itself in the browser before shipping.

u/Forward_Regular3768 — 11 days ago