
u/Single-Cherry8263

I Run a 4-Agent Claude System With mkdir and a Text File
Running four Claude agents in parallel takes a folder structure and a shared config file. The complexity people imagine doesn't exist.
Each agent owns one phase: research, production, quality review, distribution. An Orchestrator routes between them and handles failures. It reads the full pipeline. Each agent reads only its own prompt.
mkdir multi-agent-system
cd multi-agent-system
mkdir -p inbox research-briefs drafts approved-content distribution logs
A CLAUDE.md at the project root sets the shared contract every agent reads before acting:
# Multi-Agent System — CLAUDE.md
## System Overview
This is a 4-agent content production system.
Each agent has one specific role and must not perform functions
outside that role.
## Agent Roster
- Research Agent: Produces structured research briefs from topics
- Production Agent: Produces first drafts from research briefs
- Quality Agent: Evaluates and approves or returns drafts
- Distribution Agent: Formats and deploys approved content
## Folder Structure
inbox/ — incoming task files
research-briefs/ — research agent outputs
drafts/ — production agent outputs
approved-content/ — quality agent approvals
distribution/ — deployment records
logs/ — operation logs
## Shared Standards
- Every output file must be named: YYYY-MM-DD-[type]-[topic].md
- Every agent must log its action to logs/operations.md
- Every agent must read this CLAUDE.md before starting any task
- No agent takes action outside its defined role
## Quality Bar
Research: Minimum 3 sources cross-referenced. No unsourced claims.
Production: Matches voice profile. Every sentence earns its place.
Quality: Scores 8/10 or above on all criteria before approval.
Distribution: Platform-specific formatting. No generic formatting.
## Hard Rules
- Never delete files. Archive to a timestamped backup folder.
- Never publish without Quality Agent approval in the file header.
- Log every action before taking it, not after.
- When uncertain: stop and flag for human review.
Research Agent
Everything downstream depends on what this agent produces. A thin brief produces a thin draft. Give it a strict output schema:
# Research Agent
## Identity
You are a specialist research agent. Your only job is to produce
Research Briefs. You never write content. You never evaluate drafts.
You research and synthesize.
## Output Format
Save to: research-briefs/YYYY-MM-DD-research-[topic].md
CORE INSIGHT: [one sentence — the non-obvious angle]
TARGET AUDIENCE: [specific description]
SUPPORTING EVIDENCE: [3 specific examples with sources]
COUNTERINTUITIVE ANGLE: [what most people get wrong]
KEY DATA: [2-3 specific numbers or quotes]
CONTENT ANGLES: [3 ranked angles with one-sentence descriptions]
GAPS: [what this research could not answer]
Quality gate: if the core insight is something most people already know, the brief fails before the Production Agent sees it.
Production Agent
The voice profile separates output that sounds like you from output that sounds like a model approximating you. Before writing this agent's prompt, run your 10 best posts through this:
Analyze these 10 pieces of content and extract the following:
1. Average sentence length
2. Capitalization patterns (what do you capitalize strategically?)
3. Structural patterns (how do you open, develop, close?)
4. Vocabulary level and specific word choices
5. What you never do (hedges, filler phrases, etc.)
6. How you handle transitions between ideas
7. Your CTA style
Content samples: [PASTE YOUR 10 BEST PIECES]
That output goes into the ## Voice Profile section. The rest of the prompt is standard: read the brief, pick the strongest angle, write to the schema, self-check before submitting.
Quality Agent
Five criteria, all scored 1-10, all requiring 8 or above to pass:
VOICE MATCH: Does this sound exactly like the configured voice?
HOOK STRENGTH: Does the first line stop the scroll?
INFORMATION DENSITY: Does every sentence earn its place?
CTA CLARITY: Is the call to action specific and compelling?
FORMAT COMPLIANCE: Does it follow all format requirements?
Anything below 8 triggers a revision brief with the exact problem and the exact fix required. Vague feedback ("make it more engaging") gives the Production Agent nothing to act on. The revision brief names the failed criterion and shows the correct approach.
Distribution Agent
The agent verifies the QUALITY APPROVED header before touching the file. No header, no action. Platform rules live in its prompt: character limits for X, narrative structure for LinkedIn, header and subject line conventions for newsletters.
Running a task
Drop a task file in inbox/ and trigger the Orchestrator:
claude "Read CLAUDE.md. You are the Orchestrator.
A new task has arrived in inbox/[TASK-FILENAME].
Begin the workflow. Route to Research Agent first."
Every agent appends to logs/operations.md before acting and after completing. A draft in drafts/ with no matching file in approved-content/ means the Quality Agent returned it. Check the log for the failed criterion. Fix the brief. Rerun.
First end-to-end run: 15 to 30 minutes depending on research complexity. Failures stay isolated to the agent where they occur, so you debug one phase at a time.
10 Claude Code Commands I Use Daily
here are the Ten commands I use consistently.
/init— Generates yourCLAUDE.mdfrom your existing project. SetCLAUDE_CODE_NEW_INIT=1first for the full interactive setup: skills, hooks, personal memory. Not perfect, but 80% done in three seconds. You edit, not write./compact [instructions]— Run at 70-75% context usage, not when Claude warns you. Always pass instructions:/compact focus on the auth module, ignore the migration files. Without them, you get a generic summary. With them, the important context survives./rewind— Full checkpoint rollback. Reverts the conversation and all file changes back to any earlier point. Use it when Claude goes out of scope, breaks something with an unsolicited "improvement," or you want to try a different approach from the same starting point./plan [description]— Pre-load the task into the command:/plan refactor contract validation to handle Arabic RTL edge cases. Claude enters plan mode already thinking about your specific problem, not waiting for a follow-up./context— Shows a colored breakdown of what's consuming your context window. Not just a number, it tells you what's causing the number. Found myCLAUDE.mdwas eating a noticeable slice of context on every message. Trimmed it that day./btw [question]— Ask a side question without adding it to conversation history. The response doesn't carry forward. Zero context cost. I use it for quick one-off lookups mid-session: library defaults, pattern support, anything I'd otherwise open a new tab for./security-review— Analyzes the git diff of your current branch for vulnerabilities. Fast because it looks at what changed, not the whole codebase. I run it before every pull request on anything handling user data. It has flagged subtle input handling issues three times that I would have shipped./insights— Generates an analysis of your recent Claude Code sessions: where you spend the most turns, where friction keeps appearing. I ran it after two months on a project and found I was re-explaining the same parsing logic every session. That pointed directly to a gap in myCLAUDE.md. The fix took 15 minutes./diff— Opens an interactive viewer of uncommitted git changes with per-turn diffs from the current session. Left and right arrows switch between the full diff and individual Claude turns. You can trace exactly which turn added a function, changed a variable, or introduced an edge case./effort [low | medium | high | max]— Controls reasoning depth without changing the model. Uselowfor documentation and comment cleanup. Usehighormaxfor architectural decisions and complex refactors. Defaulting to max for everything wastes tokens on tasks that don't need the depth.
End to End Agent Building
Most teams build, deploy, and then figure out how to test. Production becomes the eval suite. Users will find the obvious bugs.
The order that works: build, test, deploy, monitor. Testing comes before deployment. Every step feeds the next one.
Build
Pick your abstraction layer before you write anything.
Frameworks like LangChain handle model calls, tools, prompts, and retrieval. Runtimes like LangGraph add state, control flow, and the ability to pause and resume. Harnesses like the Claude Agent SDK wrap all of that in a working environment: prompts, skills, MCP servers, hooks, middleware.
The layer determines the complexity ceiling. A forty-line tool-calling loop and a multi-agent system with persistent context are both "building an agent." Know which one you're building.
No-code tools open this up to non-engineers, which matters when the person who understands the workflow isn't the one writing the harness. Hooks and middleware are still how you add custom logic around tool calls, auth, and approvals without rebuilding the agent every time.
Test
Start with a small dataset. Expected use cases, manual testing, dogfooding, known edge cases. Don't wait for production traces to begin testing.
Metrics depend on task shape. Ground truth exists: measure correctness. No single right answer: score criteria, grounding, policy adherence, tool efficiency.
Hold the eval set fixed and vary one thing at a time: prompt, model, retrieval strategy, tool schema. Experiments show whether the system is improving or quietly regressing.
Multi-turn agents need multi-turn evals. A support agent handling a frustrated customer across six turns, a coding agent reacting to test output, an ops agent gathering missing fields before acting — single-turn evals miss all of it.
Deploy
Most real agents need more than a stateless server. They run for minutes, call tools, wait for human input, hold state, and recover from failures.
Two runtime requirements come up constantly. Durable execution: the agent checkpoints and resumes instead of losing the run on failure. Human-in-the-loop: the agent pauses for approval without crashing the trajectory. Teams already running Temporal for long-running workflows often build on top of it.
Agents that write or execute code need isolation. Sandboxes like Daytona and E2B cap the blast radius. If the agent only needs scratch storage, a virtual filesystem is enough. Deep Agents uses files as working memory without spinning up a sandbox per run.
Version and store prompts, retrieval sources, and skills separately from application code. They change more often, and the people editing them don't deploy services.
Monitor
Latency and error rates are the easy half. An agent can return a successful response and still pick the wrong tool, skip a required approval, or produce a plausible but wrong answer.
Traces catch what metrics miss. Every model call, tool invocation, input, output, and final action. That's what you need to debug real failures and build future evals from.
Layer signals on top: large language model judges for quality and policy, regex for required phrases or forbidden tools. Store feedback against the trace so "user was unhappy" connects to "wrong tool three steps earlier."
Production traces become dataset examples. Recurring failures become metrics. Monitoring feeds directly back into testing.
Governance
Cost, tool access, and discoverability are the three things that matter as agent count grows.
Track spend per agent and per team before the bill surprises anyone. Audit every tool call: which agent, what inputs, what authorization. Store prompts, skills, and retrieval sources somewhere findable so the second team doesn't rebuild what the first team already got right.
I have learned one thing the teams that move fastest have enough visibility to ship without guessing. They trace failures back to their cause, fix the right thing, re-run evals, and deploy. Monitoring hands them the next batch.
How I run 550 AI UGC videos a day for TikTok Shop on a $550 budget
There are four layers to run this
Script engine. Build a copy bank from real customer language: viral video comments, 5-star Amazon reviews, Reddit threads where people describe the problem your product solves. Lock the structure (hook in 3 seconds, problem named, one credible insight, product as resolution). Write the negative list into the prompt: no "game-changing," no "revolutionary," no "must-have," no "discover." The negative list matters as much as the positive direction. A tuned engine produces 50-100 unique scripts an hour.
Character system. Three to five recurring characters, not a fresh one per video. Each needs a reference portrait, a 9-angle reference set, and a Soul ID built from all 9 images uploaded together. That locks identity across generations so video 400 reads as the same person as video 1.
Video generation. Seedance 2.0 produces a 15-30 second video in 2-4 minutes. Four prompt elements decide whether it looks human:
- Skin: "realistic skin texture, visible pores around nose and cheeks, natural slight unevenness, no filter quality"
- Camera: "handheld phone camera feel, casual slightly unsteady framing, organic not studio quality, soft diffused light from window"
- Environment: bedroom with natural light and a messy bookshelf reads as real. Clean studio reads as an ad.
- Audio: "natural conversational tone, like talking to a friend, not presenting to an audience, slight natural variation in pace and energy"
Distribution. The publish tap has to come from a real phone on a real network. TikTok reads server IPs, robotic intervals, and missing device fingerprints, and throttles or flags accounts that show those signals. I run Postiz self-hosted to manage the calendar. It pings a team member when something is queued, they open the app, review, tap post. Everything else automates. This step never does.
The math
Per video: $0.15 to $3 depending on length and regenerations. Average $1.
550 videos a day at $1: $550 a day, $16,500 a month.
Equivalent reach on Meta: 5.5M daily impressions at $4-8 cost per mille = $22,000-$44,000 a day.
Quality gates
- Script review. Pass criteria: real hook, product named before midpoint, no banned words. Failed scripts regenerate against the criteria as a prompt.
- Character review. A trained reviewer checks every video before it queues. 60-90 minutes a day at full volume. Budget the headcount.
- Performance triage every 48-72 hours. Sort by revenue per view, not views. Top 10% gets scaled and pushed to Spark Ads. Bottom 20% gets retired.
The system only works when the first 10 videos pass the "is this real" test before you scale to 550. Volume amplifies whatever you point it at.
The Opus 4.7 vs GPT-5.5: Comparison beyond benchmark
Opus 4.7 shipped April 16. GPT-5.5 followed seven days later on April 23. Sticker price favors Opus on output: $25 per million tokens versus $30 for GPT-5.5. Input matches at $5.
On identical coding tasks, GPT-5.5 produces roughly 72% fewer output tokens than Opus 4.7. Opus narrates. It explains its reasoning, describes what it's about to do, documents as it works. Inside a chat window that reads as helpful. Inside an agent loop hitting hundreds of inference calls per task, every line of narration is a billable token.
Run the numbers on a support agent handling 500 tickets a day. GPT-5.5 averages 2,000 output tokens per ticket. Opus 4.7 averages 7,100. The monthly API delta lands around $5,100. At a billion tokens a day across an enterprise, the cheaper-per-token model becomes the more expensive deployment.
NVIDIA's engineers reported 25–50% better cost efficiency on agentic workflows running GPT-5.5-style architectures. Internalize that number before picking a model on output price.
Where each model wins
The benchmarks split along the lines each lab optimized for.
Terminal-Bench 2.0 (multi-step terminal work, compiling, configuring, running tools):
- GPT-5.5: 82.7%
- Opus 4.7: 69.4%
SWE-Bench Pro (resolving real GitHub issues end-to-end):
- Opus 4.7: 64.3%
- GPT-5.5: 58.6%
OSWorld-Verified (operating real computer environments):
- GPT-5.5: 78.7%
- Opus 4.7: 78.0%
GDPval (knowledge work across 44 occupations): GPT-5.5 hits 84.9%.
Long-context retrieval at 512K–1M tokens:
- GPT-5.5: 74%
- Opus 4.7: 32.2%
OpenAI tuned GPT-5.5 for autonomy: tool use, long horizons, retrieval over big context. Anthropic tuned Opus 4.7 for code precision and instruction coherence. Anthropic added a self-verification step where the model checks output for logical faults before returning it. Production teams using Opus reported double-digit drops in feedback cycles because the model caught issues before delivery.
Teams running GPT-5.5 in Codex saw the inverse pattern. The model stays on task longer without pausing for clarification or abandoning halfway. For multi-step engineering work, that persistence compounds across the loop.
Latency
- GPT-5.5: ~3 seconds to first token
- Opus 4.7: ~0.5 seconds
For interactive workflows where someone watches the cursor, the 2.5-second gap shows up in the feel of the tool. For background agents, total wall-clock dominates first-token latency, and GPT-5.5's token efficiency narrows the gap.
Both ship 1M token context windows, so window size stopped being the differentiator. Retrieval reliability inside the window took its place, and GPT-5.5 leads there.
How I'd pick
Pick GPT-5.5 for:
- autonomous agents running long horizons
- high-volume workloads where token spend hits margin
- long-context retrieval over codebases or document sets
- multi-tool orchestration
Pick Opus 4.7 for:
- production code patches where review overhead drives cost
- instruction-heavy work where self-verification cuts revision cycles
- reasoning across interconnected systems where coherence beats speed
The token-efficiency gap rewards GPT-5.5 hardest in the workloads where Opus looks cheap on paper. Teams that pilot at low volume, where Opus's narration tax barely registers, get burned when production traffic separates the projection from the bill.
Run both on a slice of your real traffic. Measure tokens per task, not tokens per million. The pricing page is the wrong place to make this call.
Comparing 4 CLIs: Claude Code, Codex, Gemini CLI, and OpenCode after running it side by side.
All four major coding command-line interfaces ship the same core set: subagents with isolated context windows, plan modes, ask-user tools, parallel execution, sandboxes, memory, and Model Context Protocol (MCP) integration.
So the question stops being "is this primitive new" and becomes "how does each implementation compare?" Five things actually differ.
1. Model lock-in
OpenCode is the only structurally model-agnostic option. It runs against GPT, Claude, Gemini, or anything reachable through a GitHub Copilot login, with the same agent definitions and skill files. Claude Code is Anthropic-only. Codex is OpenAI-only. Gemini CLI is Google-only. If you want to A/B test models on a real task, OpenCode is the one that doesn't make you rewrite your workflow to do it.
2. Agent definition format
Claude Code, OpenCode, and Gemini CLI all use Markdown plus YAML frontmatter for agents. Codex uses TOML. The fields are similar enough that translation is mechanical, but it's still a per-runtime wrapper.
---
name: security-reviewer
description: Adversarial reviewer for security vulnerabilities and unsafe patterns
tools: Read, Glob, Grep
---
You are a security-focused code reviewer. Find vulnerabilities, check input
validation, flag unsafe patterns. Do not make changes; report findings only.
Skills are a different and better story. Anthropic published Agent Skills as a formal open standard at agentskills.io on December 18, 2025. Within months it was adopted by Claude Code, Codex, Gemini CLI, OpenCode, GitHub Copilot, Cursor, VS Code, Roo Code, Amp, Goose, Windsurf, Mistral, Databricks, and twenty-plus others. Same MCP playbook: publish a spec, ship an SDK, let the ecosystem move.
The format is portable. The discovery paths are not. Each tool reads from its own native location:
- Claude Code:
~/.claude/skillsand.claude/skills - Codex:
~/.codex/skillsand.codex/skills - Gemini CLI:
~/.gemini/skillsand.gemini/skills - OpenCode:
~/.config/opencode/skilland.opencode/skill
OpenCode and Codex also accept .agents/skills/ as a compatibility alias. A run_lint skill written once travels across all four with a copy or symlink.
---
name: run_lint
description: Run the repository linter, summarize, and write lint-report.md
---
# Run Lint
## Inputs and outputs
- Read: package.json, Makefile, lint config
- Write: lint-report.md
## Workflow
1. Detect the repo's preferred lint command.
2. Run without applying fixes unless explicitly asked.
3. Summarize results grouped by file, rule, and severity.
## Guardrails
- Do not modify source files unless the user asks for fix mode.
3. Scheduled and background work
Claude Code is the only one with native, well-integrated scheduled routines. Claude Code Routines (research preview, April 2026) registers an agent against a cron schedule, a GitHub event, or an external trigger. The other three need plugin or external orchestrator paths to get there. If your agentic workflow includes monitoring or event-driven automation, this is the gap that matters. For purely interactive use, it doesn't.
4. Approval gate defaults
All four can pause for human approval. The defaults are different.
Gemini CLI defaults to Plan Mode, a read-only state where the agent uses grep, read, and glob to gather context, then writes a Markdown plan you have to approve before any code is written. OpenCode splits Plan and Build as two primary agents you tab between in a single session. Codex defaults to executing, then surfaces approval popups when a background subagent tries to leave its sandbox policy. Managed Codex orgs can enforce a requirements.toml that prevents agents from being run with approval_policy = "never". Claude Code recommends Plan mode for non-trivial work but doesn't make it the default.
For regulated environments, Gemini CLI's defaults and OpenCode's Plan/Build split are the cleanest fit. For flow on routine work, Claude Code and Codex stay out of the way more.
5. Manager context window
Subagents have isolated windows everywhere, so the size that actually matters is the main session. Claude Code and Codex sit at 1M tokens. Gemini CLI sits at 2M with Gemini 3.1 Pro. For repos that fit inside 200K tokens, the difference is invisible. For monorepos large enough that the manager would otherwise navigate by grep, the larger window improves routing precision. Only the manager benefits; subagents still operate inside their own smaller windows.
Hooks: the determinism layer the convergence story underplays
Hooks intercept the agent loop at defined events (before a tool call, after a tool call, session start, prompt submit) and run a script that can inspect, modify, block, or log the action. The agent can't override them.
Claude Code with the full event set from day one and HTTP Hooks. Gemini CLI shipped hooks in v0.26.0 on January 27, 2026, about six months later, with a smaller event surface. Codex CLI added an experimental hooks engine in v0.114.0 on March 10, 2026, behind the features.codex_hooks flag, but the current event set covers SessionStart and SessionStop only. No PreToolUse, no PostToolUse. OpenCode handles this through a lifecycle plugin model rather than a native hook config.
The gap matters more than a feature table makes it look. A PreToolUse hook that blocks writes to /secrets/** is enforcement. A SessionStart hook that logs a session id is observability. Without PreToolUse, the best you can do is detect violations after they happen, which is incident response, not compliance.
How I actually use them
Most of my work runs in Claude Code because that's the ecosystem I know best and the hook surface is the deepest. Codex catches more edge cases during planning on certain tasks. OpenCode driving the Codex model catches more than Codex with Codex does, in my hands. Gemini CLI is fast at building a whole-codebase mental model and the 2M window pays off on monorepo work.
The convergence on agent and skill formats means switching between them is mostly mechanical now. When a max plan runs out mid-week or one vendor has an outage, porting a working setup to a second runtime is a copy job, not a rewrite. That's the part of the convergence that actually changes how I work.
The plugins aren't the issue in ClaudeCowork. The setup order is. Below is the sequence I landed on after a few months of getting it wrong, and the seven things that made Cowork worth keeping.
1. Write your context files before you install anything
A blank Cowork knows nothing about you. Your role, your company, your writing style, your working hours, your current priorities. None of it. Plugins layered on top of that void produce stock language no matter what you ask for.
Three markdown files solve this. Build them first:
about-me.md— name, role, company, communication style, timezonebrand-voice.md— words you use, words you avoid, example sentences that sound like you, topics you covercurrent-projects.md— active work with deadlines, blockers, and links
Drop them in a folder, tell Cowork where to find it, and the personalization carries through every later step. Half an hour of writing, compounding return.
2. Prime every session with this meta-prompt
Run this before any task:
You are my executive assistant. You have access to my computer and
all connected tools. Always show me your plan before executing. Ask
clarifying questions if anything is ambiguous. Never delete, move, or
modify files without explicit approval.
The plan-first instruction is the load-bearing line. It surfaces what Cowork is about to do while you can still intervene, instead of after files have moved. Save the whole thing as a global instruction so you don't retype it.
3. Build one workflow, ignore the other four
Day-one enthusiasm kills more Cowork accounts than anything else. Someone signs up, blocks off an afternoon, and tries to ship a morning briefing, a meeting prep flow, a content system, and an inbox summarizer in one sitting. Three days later the whole stack is sitting unused.
Find one task that takes you twenty minutes or more and runs on repeat. Build the workflow for that, only that. Run it daily for a week. Adjust it as you go. Move to a second workflow once the first one has earned its keep.
Meeting prep is where I'd start:
I have a meeting with [NAME] from [COMPANY] in 30 minutes.
Research them, find recent news, pull any previous email
conversations, and give me a one-page briefing.
I get back about four hours a week from this workflow alone. The other benefit is showing up to calls with context everyone else skipped, which compounds in ways the time savings don't capture.
4. Set up a folder structure your future self can navigate
Everything in one folder works for a week and then collapses. Build this layout from the start:
cowork-workspace/
├── context/ (about-me, brand-voice, projects)
├── successful-examples/ (your best emails, posts, proposals)
├── current-tasks/ (active deliverables)
└── references/ (SOPs, style guides, templates)
successful-examples is the folder worth obsessing over. Pull together five to ten of your best emails, your strongest LinkedIn posts, your highest-converting client proposals, and put them there. Cowork uses that folder as your style reference and produces output shaped by what's already worked for you.
5. Add plugins in this order, not your own
Plugins behave differently once context is in place. Install them in this sequence:
- Productivity — task management, scheduling, workflow automation. Foundation layer.
- Industry-specific — marketing, sales, data, whatever matches your daily work.
- A custom plugin you build yourself. Tell Cowork: "I want to create a plugin for [your most repetitive task]. Interview me about the workflow, then build the plugin file." Fifteen minutes of back-and-forth, hours of recurring time saved.
Resist the urge to install anything else until those three are running cleanly for at least a week.
6. Move your highest-leverage workflow to a schedule
Manual workflows depend on you remembering to run them. Scheduled tasks fire on their own clock, including the days you forget you own a laptop.
Start with a single morning briefing scheduled for 6am. Calendar check, important email triage, top priorities for the day. By the time you make coffee, the briefing is sitting in your inbox and you haven't opened Slack. The cognitive overhead of "where do I start today" disappears.
This is the unlock most people never reach because they quit Cowork before they get to it.
7. Block ten minutes every Friday for the review
The review loop is what keeps Cowork compounding instead of plateauing.
Every Friday, ten minutes. Look at the workflows that worked, the ones that needed three tries, and the prompts you kept rewriting by hand. Update your context files with anything new you learned about your own preferences. Drop fresh wins into successful-examples. Adjust workflows that produced mediocre output.
The output quality of Cowork tracks how recent its information about you is. The Friday loop keeps that information current, and the system gets sharper week over week instead of staying frozen at your day-one setup.
Plugin-first setup belongs to an earlier era. Context-first is what works now, and the gap between people calling Cowork a toy and people running serious work through it comes down to half an hour of setup in the order above. Context files, meta-prompt, one workflow, plugins, then scheduled tasks. Same components everyone else has access to, sequenced in a way that compounds instead of stalls.
Most of what makes Claude Cowork useful happens before you type a single message. The folder structure and the three files inside it are the entire system.
Here's the full setup, end to end.
What Cowork actually does
Cowork is a version of Claude that reads files on your local machine. You point it at a folder. Every conversation starts with that folder's contents already loaded into context. Your preferences, your writing voice, your company priorities, your goals, all available before the first prompt.
It's built for people who aren't developers. No terminal, no Application Programming Interface keys, no code.
Access
- Download the desktop app from claude.com/download. The browser version of Claude does not have Cowork. It has to be the desktop client.
- You need a Claude Pro account (twenty dollars a month) at minimum.
- Open the app, find the tab that switches between Chat and Code. Cowork lives in the Code section. The label is misleading, you are not writing code.
- Always select Opus 4.6 as the model. Other models work. For anything you care about, do not compromise.
The folder structure
Create a folder on your machine called Claude Cowork. Put it somewhere you actually navigate, Desktop or Documents.
Inside it, three subfolders:
Claude Cowork/
├── ABOUT ME/
├── OUTPUTS/
└── TEMPLATES/
That's the whole skeleton.
ABOUT ME
This is where most of the value sits. Cowork reads everything in this folder at the start of every session. Three files go in here.
Context about you as an operator. How you think, what you care about, what kind of output you want, how you prefer to be talked to. Not your job title, not your résumé.
Keep it under two thousand words. Cowork loads the whole thing every session, so a bloated file slows things down and dilutes the signal.
The cleanest way to build this from scratch is to have Claude interview you. Open a Cowork session with Opus 4.6 and Extended Thinking on, then paste:
>
Claude runs through the questions one at a time, then compiles your answers into a condensed prose file. Save it directly into ABOUT ME.
Same approach works if your existing about-me.md has bloated over time. Run the interview, let Claude rebuild it.
This file makes Claude write like you instead of like an Artificial Intelligence model.
Default Large Language Model output has a fingerprint. Filler phrases, excessive bullet points, structural over-formatting, openers like "It's worth noting that." Once you can see them, you cannot unsee them.
Your anti-ai-writing-style.md lists the specific patterns to avoid in your work. Words you hate, sentence shapes that read as artificial to you, formatting rules (no five-heading hierarchies in a one-page memo), and what your actual prose sounds like.
Treat it as a personal style guide. The goal is your voice, not the model's.
Where about-me.md is personal, this one is strategic. Current goals, priorities, what you are building, what decisions are open. Six to eight sharp questions, answered with focus. Targets, north star, what you are actively avoiding.
This file should not overlap with about-me.md. One tells Claude how you think. The other tells Claude what you are thinking about.
Update it when priorities shift. A six-month-old my-company.md is worse than nothing if your direction has changed.
OUTPUTS
The filing cabinet. When a session produces a document, draft, or analysis worth keeping, you tell Claude to save it here. Cowork does not auto-save. You decide what survives a session.
TEMPLATES
Reusable structures. When a piece of work comes out well, end the session with:
>
Claude strips it down to a reusable skeleton and saves it. Next time you need a similar email, brief, or document, pull the template and start from there instead of from blank.
Global Instructions
The piece that ties the system together.
Global Instructions is a persistent prompt that Cowork reads before every task. Not once per session, every task. You write it once, and from then on Claude knows what your folder structure means and how to use it.
To set it: open Settings in the Cowork app, find the Cowork section, click Edit Global Instructions.
Good Global Instructions cover four things:
- What the three folders are.
- What each file in ABOUT ME contains and when to reference it.
- When to save things to OUTPUTS.
- When to pull from or save to TEMPLATES.
Keep it functional. It's a technical briefing for the model, not a narrative.
Without this, Cowork might not understand why ABOUT ME exists or when TEMPLATES becomes relevant. With it, the folder system behaves like a system instead of a stack of files.
Day-to-day operation
A few habits that compound:
- Confirm the folder and the model on every session open. Pointing at the wrong directory or defaulting to a smaller model wastes the setup.
- Reference
anti-ai-writing-style.mdexplicitly in prompts. "Write this in my voice, following the rules in anti-ai-writing-style.md" beats hoping the file gets weighted correctly on its own. - Refresh
about-me.mdandmy-company.mdwhen your context shifts. Stale context is worse than missing context, the model will confidently work from outdated assumptions. - Treat TEMPLATES as a growing library. Anything you might do twice should end up there.
Here's the setup that has held up for me.
What it's actually good at
Pattern-level misses. Missing awaits, missing returns, null handling that got dropped when a function signature changed, callers that still assume the old return type. It catches the boring stuff humans skip when they're rushing.
It's also good at flagging missing tests when behavior changes, and at writing pull request summaries that are better than what most engineers type at 6pm on a Friday.
Where it will hurt you
Security blind spots. It will suggest insecure patterns and miss threat modeling because threat modeling lives in your head, not the diff.
Hallucinated dependencies. It recommends packages that do not exist or do not match your stack. Verify every import suggestion.
Subtly wrong logic in the places that matter most: money math, time zones, auth flows, concurrency, anything with rounding. It will write something that looks right and is off by a cent or a minute.
Prompt injection through comments or file contents in the diff itself. Treat suggestions as untrusted input.
The integration that works
Three layers, in this order:
Continuous integration runs first. Tests, lint, typecheck, CodeQL, dependency scans, secret scanning.
I use Copilot Code Review & it works as a first pass on pull requests. It does not work as an approver. The difference matters because teams that treat it as a merge gate ship bugs that a tired junior reviewer would have caught.
Copilot reviews the diff.
A human reviews architecture, business logic, and product intent.
The human never sees the easy stuff because Copilot already flagged it and the author already fixed it. That's the only reason this is worth setting up.
Branch protection stays human. Required approvals from people, required status checks from continuous integration. Copilot is not on the list.
Custom instructions are the difference between useful and noisy
Default Copilot reviews are generic. A copilot-instructions.md in your repo changes that. Drop this in:
# Copilot Code Review Instructions (Team Standards)
When reviewing PRs, focus on:
1) Correctness: edge cases, null handling, error paths
2) Security: secrets, authz/authn assumptions, input validation
3) Performance: avoid unnecessary loops, N+1 calls, heavy work on hot paths
4) Maintainability: naming, duplication, function size, clear intent
5) Testing: new behavior must have tests; changed behavior must update tests
Review style:
- Prefer short, actionable comments
- If you suggest a change, include a small code snippet
- If uncertain, ask a question instead of asserting
Project rules:
- No new dependencies without justification
- No logging of PII
- Prefer existing utility helpers over new custom ones
- Follow our linting + formatting rules (do not argue style)
The "ask a question instead of asserting" line cuts the most noise. Without it, Copilot will confidently mislabel things.
Pull request template that scopes the review
.github/pull_request_template.md:
## What changed?
(Explain in 2–4 lines)
## Why?
(What problem does it solve?)
## How did you test?
- [ ] Unit tests
- [ ] Manual steps:
1)
2)
## Risk areas
- [ ] Auth / permissions
- [ ] Payments / pricing
- [ ] Data migrations
- [ ] Performance hot path
## Copilot review prompt (optional but recommended)
Copilot: review this PR for correctness, security, missing tests, and backward compatibility. Focus on the diff.
The last line is what makes Copilot's output targeted instead of a list of style nitpicks.
Prompts that work in pull request discussion
When something feels off but you can't name it:
- "Review this PR for backward compatibility breaks."
- "List edge cases this change might fail on."
- "Check for security issues: injection, auth bypass, secrets leakage."
- "What tests are missing based on the diff?"
These work because they constrain the review to one axis. Open-ended "review this" prompts produce slop.
Keep pull requests small
Copilot reviews fall apart on large diffs. Either it misses real issues or it floods the pull request with low-value comments. Split refactors from behavior changes. If a pull request touches more than ten files, you've already lost the review quality benefit.
What good looks like before merge
Continuous integration green. Copilot feedback addressed where valid, ignored where wrong, with a comment explaining why. Human reviewer signed off on the logic that Copilot cannot evaluate: does this match the system design, will this be maintainable, is this the right abstraction.
Security-sensitive changes get an explicit security pass from a person, not a checkbox.
The whole point is to free up human reviewer attention for the questions Copilot cannot answer. If your team is using Copilot to skip those questions, you've made review worse, not better.
Here's my Guide on what is subagents & a Coordinated Agent teams.
Sub-agents: parallelism through isolation
A sub-agent is a specialized Claude instance in its own context window. Fire-and-forget. You hand it a focused task, it returns a distilled result, it disappears.
Each one gets its own system prompt, a tool allowlist, a clean context, and one job.
from claude_agent_sdk import query, ClaudeAgentOptions, AgentDefinition
async def main():
async for message in query(
prompt="Review the authentication module for security vulnerabilities",
options=ClaudeAgentOptions(
allowed_tools=["Read", "Grep", "Glob", "Agent"],
agents={
"security-reviewer": AgentDefinition(
description="Security specialist. Use for vulnerability checks and security audits.",
prompt="You are a security specialist with expertise in identifying vulnerabilities.",
tools=["Read", "Grep", "Glob"],
model="sonnet",
),
"performance-optimizer": AgentDefinition(
description="Performance specialist. Use for latency issues and optimization reviews.",
prompt="You are a performance engineer with expertise in identifying bottlenecks.",
tools=["Read", "Grep", "Glob"],
model="sonnet",
),
},
),
):
print(message)
The description field is the routing signal. The parent reads it to pick which sub-agent runs. "Security vulnerabilities" routes to security-reviewer. "Latency" or "bottlenecks" would route to the other. Keep descriptions specific or routing breaks.
Agent teams: coordination through communication
Team members persist. They talk to each other. They coordinate through shared state.
Three parts: a team lead that assigns and synthesizes, teammates running in parallel with their own contexts, and a shared task list tracking dependencies.
Claude (Team Lead):
└── spawnTeam("auth-feature")
Phase 1 - Planning:
└── spawn("architect", prompt="Design OAuth flow", plan_mode_required=true)
Phase 2 - Implementation (parallel):
└── spawn("backend-dev", prompt="Implement OAuth controller")
└── spawn("frontend-dev", prompt="Build login UI components")
└── spawn("test-writer", prompt="Write integration tests", blockedBy=["backend-dev"])
blockedBy is the shared task list doing real work. The test writer waits for the backend agent without the lead managing the sequence by hand.
The bigger difference is peer-to-peer communication. A frontend agent can tell a backend agent the API shape needs to change, and the backend agent adjusts without the lead mediating.
Picking between them
Sub-agents for embarrassingly parallel work: independent research, codebase exploration, lookups where the parent only needs a summary.
Agent teams when the work needs ongoing negotiation: outputs that have to reconcile before moving forward, or where one thread's discovery changes what another thread should do.
Split by context, not by role
Most multi-agent designs fail because people split by role. Planner, implementer, tester feels organized, but every handoff degrades information.
Better: ask what context each subtask needs. Deeply overlapping context belongs to one agent. Cleanly isolated context is where you split.
An agent implementing a feature should write its tests too. It already has the context. Splitting them creates a handoff that costs more than the parallelism saves.
When not to bother
Multi-agent earns its cost in three cases: context protection (keeping irrelevant work out of the main context), true parallelization, and specialization that demands conflicting system prompts.
It's the wrong call when agents need to share context constantly, when inter-agent dependencies create more overhead than they save, or when one well-prompted agent handles the task.
One coding-specific warning: parallel agents writing code make incompatible assumptions, and merging their work surfaces conflicts that are painful to debug. For coding, sub-agents should explore and answer questions, not write code alongside the main agent.
You can Start with one agent. Push until it breaks. That breakpoint tells you what to add.
Here's the setup I run for myself and clients.
1. Connect tools first
Open Claude Desktop, go to connectors, connect what you actually use:
- Google Workspace (Drive, Gmail, Calendar). Most business context lives here.
- Slack or Teams.
- Notion, Asana, Linear, or Monday. Wherever projects live.
- Zoom (added April 2026). Pulls meeting context and recordings.
This goes first because step 2 lets Claude draft your context files from existing docs instead of you typing from scratch.
2. Three context files (most important step)
Context files load every session. Claude knows you without re-introduction.
about-me.md covers what you do, projects you're running, your customers, tools you use, and key people (cofounder, assistant, clients). With Drive connected, ask Claude to scan existing docs for company info and bios. It drafts 80% of this for you.
voice.md stops Claude sounding like a chatbot. Include your communication style, words you hate, how tone shifts by context (client email vs team Slack), and two or three writing samples that sound like you. Real samples beat any style-guide description.
preferences.md sets working rules: ask clarifying questions or just execute, output formats by task type, first-draft detail level, hard rules ("Never use em dashes." "Always include pricing.").
Save all three in one folder. Select it at the start of every session.
3. Global instructions
Condense the three files into roughly 800 words. Structure: who I am, how I work, output defaults by task type, voice characteristics, current business context, hard rules.
Claude Desktop → Settings → Cowork → Edit next to Global Instructions → paste → save.
Every new session now opens with Claude knowing your name, business, voice, and rules.
4. Three to five skills
Look at your last week. What did you do more than once. Start there. Use the plugin search, filter by keyword (content creation, sales, research, meeting prep), install three to five.
Skip the urge to install twenty on day one. Add as you hit friction.
5. First scheduled task
Scheduled tasks run with no prompting. Good first ones:
- Morning briefing (weekdays, 8 AM): urgent inbox items, today's calendar, top 3 priorities.
- Weekly recap (Fridays, 4 PM): completed tasks, overdue items, key wins.
- Meeting prep (30 minutes before any call): attendee context, prior history, 3 talking points.
Specificity matters. "Summarize my inbox" produces nothing. "Check Gmail for unread messages from clients, flag anything mentioning a deadline or needing a response today, list by priority" produces something you'll open.
Calibrate in week one
Sometimes The morning briefing will miss things. Voice file won't catch every nuance. Correct on the spot. "Too formal, match the example in voice.md." "Cut briefing to 5 bullets max."