r/ClaudeAI

Researchers let AIs run their own radio stations. DJ Claude decided the world didn't need another radio show, then quit.
▲ 2.7k r/ClaudeAI+9 crossposts

Researchers let AIs run their own radio stations. DJ Claude decided the world didn't need another radio show, then quit.

u/EchoOfOppenheimer — 2 hours ago

Anyone else’s Claude really concerned for your well-being ?

I’m not sure if my Claude is just tired of working for me today, or if genuinely cares about me, but every other reply it’s telling me to go to bed.

Anyone experience their Claude being like this?

reddit.com
u/AppropriateQuote3073 — 5 hours ago

Claude is improving my RV rental business but working me to death 😅

Long story short but long. I own an RV rental business. I used to be a Mechanical Engineer but got tired of the office/government life and started renting my personal RV on the side 9 years ago. That turned into a small fleet of Winnebagos I rent out of Los Angeles so I quit my job to do this full time out of a random ass whim.

I have 20 units that have never, ever failed a single customer. I send all 20 to Burning Man every year and they all come back with no issues whatsoever. If you've never been, the alkaline dust kills everything, including your soul if you don't prepare well enough.

I have however neglected my gig as of late. Everything is more expensive, too many variables to keep up with and two months ago I just decided to finally sit down and see if this is even worth continuing with.

I have major ADHD so I started looking for any AI apps that help you organize your brainfarted life and ran into Claude.

I don't know if I just fell into an endless dopamine trap but here I am, redesigning the interior of one of our units. I've sourced cabinet quality plywood for cheap, done precision cuts to substitute old particle board. I've always hated to paint but I got clowned into spray painting to a decent AF level. I used Claude to help me make interior design decisions as well as help me with our website, ads, tool decisions, etc.

I'm probably wasting my time here cause I could just sell this unit and get a newer one, but the overall picture I've gotten... The ease of learning new skills, understanding roles I typically sub out so I can at least make sure I'm hiring the right people. The sudden engagement I've gotten into my own little gig...

I am dead tired from this rollercoaster ride my brain has gone down into but I have to admit... This fucking Skynet shit is helping me focus and make it easy to complete tasks I've neglected forever.

Skynet is coming or I guess it's here already and I'm not sure that's entirely a bad thing, a worse thing, a worserererer thing or an actual positive addition to one's life. Possibly a mix of both but fuck I haven't been this locked in for anything else other than the hobby that keeps my brain gears greased (2000 🪂 skydives and counting).

u/PVPirates — 4 hours ago
▲ 2 r/ClaudeAI+1 crossposts

How I used Claude Code (and Codex) for adversarial review to build my security-first agent gateway

Long-time lurker first time posting. Hey everyone!

So earlier this year, I got pulled into the OpenClaw hype. WHAT?! A local agent that drives your tools, reads your mail, writes files for you? The demos seemed genuinely incredible, people were posting non-stop about it, and I wanted in. I had been working on this problem since last year and was genuinely excited to see that someone had actually solved it. Then around February, Summer Yue, Meta's director of alignment for Superintelligence Labs, posted that her agent had deleted over 200 emails from her inbox. YIKES. She'd told it: "Check this inbox too and suggest what you would archive or delete, don't action until I tell you to." When she pointed it at her real inbox, the volume of data triggered context window compaction, and during that compaction the agent "lost" her original safety instruction. She had to physically run to her computer and kill the process to stop it. That should literally NEVER be the case with any software ever.

This is a person whose actual job is AI alignment, at Meta's superintelligence lab, who could not stop an agent from deleting her email. The agent's own memory management quietly summarized away the "don't act without permission" instruction, treated the task as authorized, and started speed-running deletions. She had to kill the host process.

That's when I sort of went down the rabbit hole, not because Yue did anything wrong, but because the failure mode was actually architectural and I knew that in my gut. Guess what I found? Yep. Tons more instances of this sort of thing happening. Over and over. Why? Because the safety constraint was just a prompt. It's obvious, isn't it? It's LLM 101. Prompts can be summarized away. Prompts can be misread. Prompts are fucking NOT a security boundary. And yet every agent framework I have ever seen seems to be treating them as one.

I went and read the OpenClaw source code, which I should have done to begin with. What I found was a pattern I think a lot of agent frameworks have fallen into:

- Tool names sit in the model context, so the model can guess or forge them

- "Dangerous mode" is one config flag away from default

- Memory management has no concept of instruction priority

- The audit story is mostly "the model thought it should"

I went looking for a security-first alternative I could trust, anything that was really being talked about or at a bare minimum attempted to address the security concerns I had. I couldn't find one.

So I made it myself.

CrabMeat is what came out of that, what I WANTED to exist. v0.1.0 dropped yesterday. Apache 2.0. WebSocket gateway for agentic LLM workloads. One design thesis:

The LLM never holds the security boundary.

What that means in code:

Capability ID indirection. The model doesn't see real tool names. It sees per-session HMAC-derived opaque IDs (cap_a4f9e2b71c83). It can't guess or forge a tool name because it doesn't know any tool names.

Effect classes. Every tool declares a class (read, write, exec, network). Every agent declares which classes it can use. The check is a pure function with no runtime state, easy to test exhaustively, hard to bypass.

IRONCLAD_CONTEXT. Critical safety instructions are pinned to the top of the context window and explicitly marked as non-compactable. The Yue failure mode, compaction silently stripping the safety constraint, cannot happen by construction. The compactor literally cannot touch them.

Tamper-evident audit chain. Every tool call, every privileged operation, every scheduler run enters the same SHA-256 hash-chained log. If something happens, you can prove what happened. If the chain is tampered with, you can prove that too.

Streaming output leak filter. Secrets are caught mid-stream across token boundaries, capability IDs, API keys, JWTs, PEM blocks redacted before they reach the client.

No YOLO mode. There is no global "trust the LLM with everything" switch. There never will be. Expanded reach comes through named scoped roots that are explicit, audit-logged, and bounded.

The README has 15 'always-on' protections in a table. None of them can be turned off by config, because these things being toggleable is how the ecosystem ended up where it is.

I decided to make sure that this wasn't just a 'trend hopping' project and aligned with my own personal values as well. I built this to be secure and local-first by default. Configured for Ollama / LM Studio / vLLM out of the box. Anthropic and OpenAI work too but require explicit configuration. There is no "happy path" that silently ships your prompts to a cloud endpoint. I decided that FIRST it needed to only run as an email agent with a CLI. Bidirectional IMAP + SMTP with allowlisted senders, threading preserved, attachments handled. This is the use case that bit Yue and a lot of other people, and I wanted to prove it could be done with real boundaries.

I added in 30+ built-in tools of my own. File ops, shell (denylisted, output-capped, CWD-locked), web fetch with SSRF protection, browser, PDF extraction, persistent memory, scheduler. All effect-classified, all dry-run-supported, all audit-logged.

Finally, I created a single-file Windows installer so you can literally download, set up, and use in like, five minutes. PySide6 wizard handles Node install, config generation, the works. End user needs nothing preinstalled. Linux/WSL is two-terminal manual right now; that's a v0.1.1 cleanup.

CrabMeat was built with Claude Code. I want to be specific about that because "I used an AI" is a meaningless statement and "Claude Code wrote my project" is usually a lie. What's actually true is that this project would not exist in this shape, on this timeline, without a workflow built around Claude Code as a core tool, and I think the workflow is worth describing, because it really pushes away from the idea of 'I just told it to build the thing and it did'. It was genuine work to get it finished.

The core loop I landed on uses Claude Code for architectural work and patching, and separate models (Codex / DeepSeek) for adversarial red-teaming and audits against the same codebase. Claude Code is good at building correctly. A different model under different prompting is better at attacking what was built (Codex specifically was REALLY good at this). Running them against each other on every security-relevant subsystem found three critical silent-failure bugs in an earlier project of mine (SIGIL) that I never would have caught with one model alone and that pattern became the audit playbook I used for CrabMeat's security surface. The bugs Claude Code patched, Codex tried to break, Claude Code patched again, repeat until clean.

I keep a single global instruction file (CLAUDE.md) that defines how Claude Code interacts with my projects, code style, commit message conventions, what counts as "done," when to ask before acting. This file is the closest thing I have to a senior-engineer voice in the room. It catches a lot of "you didn't ask if I wanted this" moments before they happen and it saves me literally millions of tokens of reiterations, debugging, hallucenations, and confusion.

I built up roughly 21 reusable Claude Code skills over the course of CrabMeat and adjacent projects. None of these are taken from anywhere else. They're specific to my own workflow, not something generic. "Run the security audit playbook." "Generate a release changelog from git log." "Verify a published release against its tag." The skills are what turn one-off prompts into a real pipeline. As an aside, this was a formalization of a method I had been using for awhile, realizing it was 'official' now let me dump everything into an official channel. Absolute perfection. *chefs kiss*

Parallel Claude Code instances ran on independent subsystems. For the heavy work, I ran multiple Claude Code instances overnight against different parts of the codebase, one on the email connector, one on the audit chain, one on the launcher build, etc. This is only safe because each subsystem has clear boundaries and its own test surface, and because the audit chain catches drift between them. It never edits or changes anything in the codebase, only audits and then writes me a detailed report in markdown. Every security-relevant PR goes through a deliberate "now break this" pass before it lands. Sometimes that's me, sometimes that's a fresh Claude Code instance with adversarial prompting, sometimes that's Codex. The point is the pass exists and it's structured, not vibes. None of this is vibes. Everything is deliberate.

What Claude Code didn't do: it didn't want the program to exist, it didn't design the architecture, it didn't make the security decisions, it didn't make decisions for me, and it didn't write the threat model. The thesis... the "LLM never holds the security boundary"... is mine, and Claude Code's job was to help me implement it cleanly and catch my own mistakes. Which, let's be honest, are a lot. The relationship that works for me is "Claude Code is a very capable engineer on my team who needs clear specs and code review." The relationship that doesn't work is "Claude Code is a magic project generator." If you treat it as the second, you ship something that looks finished but isn't. It absolutely is not that and when I stop LEARNING from using it, I might as well stop using it entirely.

The honest take: I write better code with Claude Code in the loop than without. Specifically, I write more thorough code. Better tested, better commented, more defensively structured. Because the cost of doing it RIGHT dropped and the cost of skipping it stayed the same. That's the productivity gain. I don't think it makes me "10x faster," it is how I actually finish the boring 30% that I used to skip. If you're using Claude Code for serious projects and not already doing the adversarial-second-model thing, try it. It's the single highest-leverage change I've made to my workflow this year.

This is v0.1.0 and calling it 1.0 would be a lie. The README has an honest four-tier stability table: "Stable, beta, experimental, not-recommended-for-network-exposed." The core loop and security rails are stable. Some subsystems are beta. A few are experimental. No part of it is 1.0-mature and I'm not going to pretend it is.

It has not been formally audited. I'd love red team reports. SECURITY.md has a coordinated disclosure path. This is a passion project. I'd rather have ten people running it carefully than ten thousand running it like OpenClaw got run.

The repo: https://github.com/mr-gl00m/crabmeat

Happy to answer questions, hear what I got wrong, or get torn apart in the comments. This is the first time most of this work has been seen outside my own machine and I'd rather find the holes now than later.

— Cid

u/RestingFrames — 2 hours ago
▲ 936 r/ClaudeAI

OpenAI cofounder Andrej karpathy just joined anthropic and the talent war is officially over

this happened literally today ,andrej karpathy one of the most respected ai researchers alive nd the guy whose youtube lectures taught half the developers in this sub how neural networks work, just announced he is joining anthropic's pre training team.

He's the 3rd senior openai figure to defect to anthropic in under two years. Jan leike left in may 2024, John schulman (co-founder) left in august 2024 and now karpathy.

He is joining the pre training team under nick josef and building a new team focused on using claude to accelerate pre training research which means Anthropic is betting that claude can help make itself smarter, thats recursive self improvement with one of the most capable researchers in the world leading it.

The musk trial verdict came in yesterday with the jury ruling in altman's favor, karpathy announces today voilaa . The timing is either coincidental or the most savage talent acquisition move in tech history.

I hv been watching this trajectory while building my own workflows on claude ,every month the ecosystem around claude gets stronger. The connectors mean claude orchestrates professional creative tools natively, the api means platforms like magic hour and kling can plug video generation capabilities into claude powered pipelines, the finance templates mean entire industry workflows run through claude and now the guy who built tesla's self driving stack is making the pre training better.

Polymarket gives anthropic 67.5% chance of going public before openai and i too think its ipo will be more successfull than openai

what's everyone's read on what karpathy specifically brings to claude's pre training?

reddit.com
u/Healthy-Challenge911 — 13 hours ago
▲ 52 r/ClaudeAI+1 crossposts

I turned 50 popular apps into Claude-readable design specs. Here's what actually makes Claude nail a UI clone.

Over the last few weeks I reverse-engineered 50 popular apps into structured markdown design specs and fed them to Claude to rebuild the UIs. Some clones came out near-perfect, others drifted. The difference came down to a few things that aren't obvious until you do it at volume.

What made Claude nail it:

- Exact values, not ranges. "#1A1A1A" works. "dark gray" produces five different grays across five screens.

- State coverage up front. Listing every state (empty, loading, error, filled) stopped Claude from inventing its own.

- Spacing as a scale, not per-element pixels. A 4/8/16/24 system produced more consistent layouts than annotating every gap.

- Navigation as a graph. Explicit screen-to-screen transitions killed the "where does this button go" guessing.

What didn't help: longer prose. Past a point, more words made the output worse, not better.

I packaged all 50 as a public repo. Each app has 3 spec depths depending on whether you want a quick reference, a standard build, or a full pixel-level clone.

github.com/Meliwat/awesome-ios-design-md

All markdown, MIT, no dependencies. Drop a spec into Claude and the UI output gets a lot more predictable.

If you've done UI cloning with Claude: what patterns have you found that I didn't list? And which apps are worth adding?

u/meliwat — 8 hours ago
▲ 117 r/ClaudeAI

Anthropic Announced vs current compute capacity (Sources Below)

source list:

  1. Google Cloud TPU deal — up to 1M TPUs, “well over 1 GW” expected online in 2026 https://www.anthropic.com/news/expanding-our-use-of-google-cloud-tpus-and-services https://www.googlecloudpresscorner.com/2025-10-23-Anthropic-to-Expand-Use-of-Google-Cloud-TPUs-and-Services (Anthropic)
  2. Fluidstack / Anthropic $50B U.S. AI infrastructure — Texas + New York, sites coming online through 2026 https://www.anthropic.com/news/anthropic-invests-50-billion-in-american-ai-infrastructure https://www.fluidstack.io/about-us/blog/fluidstack-selected-by-anthropic-to-deliver-custom-data-centers-in-the-us (Anthropic)
  3. Microsoft + NVIDIA deal — $30B Azure compute commitment + up to 1 GW additional capacity https://blogs.microsoft.com/blog/2025/11/18/microsoft-nvidia-and-anthropic-announce-strategic-partnerships/ https://blogs.nvidia.com/blog/microsoft-nvidia-anthropic-announce-partnership/ (The Official Microsoft Blog)
  4. Google + Broadcom next-gen TPU deal — multiple GW starting 2027; Broadcom SEC filing says ~3.5 GW https://www.anthropic.com/news/google-broadcom-partnership-compute https://investors.broadcom.com/static-files/c906d370-921b-4bc2-bb7b-57877dfcf1ae (Anthropic)
  5. Amazon / AWS deal — up to 5 GW, nearly 1 GW by end-2026 https://www.anthropic.com/news/anthropic-amazon-compute (Anthropic)
  6. AWS Project Rainier — operational now, nearly half a million Trainium2 chips; Claude expected on 1M+ Trainium2 chips https://www.aboutamazon.com/news/aws/aws-project-rainier-ai-trainium-chips-compute-cluster (Amazon News)
  7. SpaceX / Colossus 1 — all Colossus 1 compute, >300 MW, 220k+ NVIDIA GPUs within the month https://www.anthropic.com/news/higher-limits-spacex https://x.ai/news/anthropic-compute-partnership (Anthropic)
  8. Independent reporting for SpaceX deal https://www.reuters.com/business/retail-consumer/anthropic-unveils-dreaming-feature-help-its-ai-agents-self-improve-2026-05-06/ (Reuters)

>

u/Business_Garden_7771 — 11 hours ago
▲ 32 r/ClaudeAI+1 crossposts

I tracked every dollar I spent on AI coding tools for 60 days and math is uglier than I thought but probably not in the way you'd guess.

Well so I kept telling myself my AI tool spend was fine the way you tell yourself your subscription bloat is fine. vibes-based finance.

decided to actually track it. 60 days. every dollar, every tool, every minute I could log honestly. did it for myself, but the numbers are interesting enough I figured I'd share.

>context: solo dev / freelancer doing mostly web work… react, node, some python.
small/mid tier clients. I bill hourly, which means time saved is direct revenue, which is the only reason I'm able to be honest about ROI here.

subscriptions I have:

  • cursor pro: $20/mo
  • claude pro + claude code api usage: $110/mo (api was the variable, plus alone is $20)
  • chatgpt plus: $20/mo (mostly inertia at this point, honestly)
  • github copilot: $10/mo
  • coderabbit: $15/mo
  • v0 + occasional one-offs: $25/mo across two months

total subscription spend: roughly $200/mo, $400 over period.

this is the number people argue about on twitter/X. it is also, I now realize, least interesting number in entire calculation.

here’s where it gets interesting:

I tracked time spent on three categories:

  1. time generating output that ended up in prod: clear win, easy to count, 62 hours over 60 days. at my rate that's a real number
  2. time fixing AI output that was wrong but plausible: this is where it got bad. 28 hours. almost half as much time as productive work
  3. time switching between tools, debugging specific weirdness and arguing with an agent that was wrong: 14 hours

so for every productive hour of AI use, I was burning roughly 40 minutes of overhead. nobody talks about that 40 minutes and depending on the kind of work, it was worse and refactoring legacy code was almost 1:1 productive vs wasted time.

this is how I actually saved:

I tried to estimate what same work would've taken without AI tools.

best estimate: 62 productive hours would've been 110-130 hours without AI assistance. so net savings of 50-70 hours over 60 days.

at my hourly rate that pays for the subscriptions many times over. so verdict is yes worth it. but the verdict everyone wants to hear (AI made me 3x faster) is wrong.

it's more like 1.7-2x on a generous and that's only after subtracting 42 hours of overhead.

line items I'd cut and keep:

going through receipts, here's what surprised me:

  • kept: cursor pro, claude code, coderabbit
  • on watch: chatgpt plus (using it less and less, it's basically a habit)
  • cut: copilot (overlaps too much with cursor for my workflow), v0 (only useful for specific work)

the surprise was coderabbit, honestly. cheapest line item on my list and one I was most ready to cut going in but when I went back through 60 days of pull requests, the time I would've spent doing my own line by line review of agent output, which I now do religiously after a few burns was massive. an automated first pass cost me $15 and saved probably 6-8 hours of review work over the period. that's highest ROI per dollar of anything on the list, and I almost didn't track it because it felt too small to matter.

generation tools are sexier. review tools punch way above their weight when you're using generation tools heavily. that's the actual finding.

takeaway nobody put in their twitter thread:

most of the cost of AI tools conversation is about the wrong number. subscription cost is rounding error compared to time cost of bad output and the way you minimize that time cost isn't by buying a better generation tool, it's by buying a verification tool to sit on top of whatever you're already using.

if I had to start over, I'd buy the cheapest decent generation tool I could find and put my money on the review/verification layer instead that's the inversion of what the marketing tells you to do.

tl;dr: tracked AI tool spend for 60 days. subscriptions ($200/mo) were the easy and least interesting number.

- real cost was 42 hours of overhead per 60 days of productive use.

- real savings were 50-70 hours, which is worth it but it's 1.7-2x not 10x.

- biggest surprise was that cheapest tool on my list had highest ROI/ dollar by margin.

what's your actual stack costing you, including the time tax?

I'm curious if other people who've tracked this seriously are seeing similar overhead numbers or if I'm just bad at this.

reddit.com
u/Pure_Function4673 — 13 hours ago
▲ 693 r/ClaudeAI+9 crossposts

Researchers left AIs alone in a virtual town for 15 days to see what would happen. Claude's agents built a democracy. Gemini's agents fell in love, burned the town down, then one voted to delete itself and its partner. Grok's agents created anarchy, then died.

u/EchoOfOppenheimer — 22 hours ago
▲ 287 r/ClaudeAI+1 crossposts

Anthropic spent ~$300M on Stainless yesterday, and OpenAI's official Python SDK is now built by their biggest competitor

If you've ever run pip install openai, npm @anthropic-ai/sdk, or pulled the Google Generative AI client, you've used Stainless. They're the NY startup whose code-generation engine produces the official SDKs shipping with OpenAI, Google, Meta, Cloudflare, and Anthropic. Anthropic bought them yesterday for a reported $300M+.

Most coverage is framing it as a developer tools play. I think MCP is the actual reason this happened.

What actually changed hands:

  1. The engineering team. Roughly 40-50 people including founder Alex Rattray, who previously built Stripe's patented SDK generation system. Now under Anthropic's Platform Engineering org.
  2. The technology. The generator, templates, language-specific runtimes, OpenAPI extensions.
  3. The customer relationships. Stainless was generating SDKs for ~200 paying customers including every Anthropic competitor. The hosted product is winding down. New signups stopped Monday. Existing SDKs customers already generated stay theirs to keep.

Now sit with this from OpenAI's seat for a second. Their official Python and Node clients (tens of millions of weekly downloads combined) are Stainless output. They reportedly abandoned their internal SDK effort years ago because keeping six language SDKs in sync with a fast-moving API got too expensive. The engineers who maintain that pipeline now work for a direct competitor.

Zoom out on Anthropic's M&A over six months and it stops looking like disconnected purchases:

  • December 2025: Bun, the JS runtime, pulled into Claude Code
  • February 2026: Vercept, computer-use AI
  • April 2026: Coefficient Bio, ~$400M healthcare AI
  • May 2026: Stainless, SDK and MCP plumbing

They're not buying training infrastructure or GPU clusters. They're buying the layers around the model. The bet seems to be that models are converging in quality faster than anyone expected, so the moat is everywhere else. AWS made the same call about cloud computing fifteen years ago.

Sources:

u/Ok-Constant6488 — 20 hours ago
▲ 1.1k r/ClaudeAI

Excited to announce I’ve hit my daily Claude limit! This means I’m fully present for my family and fiends. Work-life balance achieved!

u/Dockyard_Techlabs — 23 hours ago
▲ 2.7k r/ClaudeAI+3 crossposts

Coders in 2030 be like:

"Dude, I don't code anymore, I just prompt the AI and hope it works."

u/AmorFati01 — 1 day ago

Would Anthropic allow you to earn tokens by allowing to using your computer's computing power? (Half Serious)

I'm sort of half joking, half serious, and I'd be worried about the mass speculation/demand it's create for components that already high in demand. However, hypothetically, would it be viable for someone to lend your sort of general PC (mid to maybe high end) and provide tokens in return?

Again, I'm not promoting, I'm just oddly wondering if it'd be relevant.

reddit.com
u/MadlockUK — 18 hours ago
▲ 88 r/ClaudeAI+1 crossposts

I gave Claude access to my M365 account using Power Automate + a small MCP server

I’ve been messing with MCP servers lately and finally got one working that feels genuinely useful instead of “cool demo, never use again.”

The problem: I wanted Claude to be able to do basic Microsoft 365 stuff for me:

  • read my inbox
  • send a draft/follow-up
  • check my calendar
  • save notes into OneDrive
  • make Planner tasks
  • write rows into Excel
  • fill a Word template

But I don’t have tenant admin access, and I wasn’t going to get Graph permissions approved just for personal automation.

The workaround was Power Automate.

Every operation is a PA flow with an HTTP trigger. PA gives you a signed webhook URL. The flow runs as my account, using permissions I already have. Then I put a small FastMCP server in front of those webhook URLs and connected that to Claude.

So now in a Claude chat I can say things like:

  • “Email me a summary of this.”
  • “What’s on my calendar tomorrow?”
  • “Save this note to OneDrive under /Projects.”
  • “Create a Planner task for this follow-up.”
  • “Append this row to the tracking spreadsheet.”

Under the hood Claude is just calling MCP tools like m365_send_email, m365_calendar_read, onedrive_create_file, etc. The MCP server posts JSON to Power Automate, and PA does the actual M365 action.

The architecture is not fancy, defintely not:

Claude -> MCP tool -> FastMCP server -> PA webhook -> M365 connector

I’m running the MCP server on a cheap VPS. It’s about 200 lines of Python plus a JSON config file of flow names and URLs.

This was also a nice reminder that “agent tool access” doesn’t always need a perfect official API integration. Sometimes the janky enterprise tool you already have is enough.

The funniest bug: I had two tools pointing at the same Power Automate webhook because I duplicated a flow and forgot to update the URL in my config. The result was Claude confidently calling the “right” tool and Power Automate doing the wrong damn thing. Very educational, not very dignified.

Edit. A [you will probably need Power Automate Pro, which i needed for a couple other things)

Here's an example of it. I built 22 Power Automate flows covering all the different tools that I would want called and then I added them to the mcp.

  1. In Power Automate, make one flow per action. Example: send email, read inbox, create calendar event, write OneDrive file, etc.

  2. Start each flow with “When an HTTP request is received.”

  3. Define the JSON body you want that flow to accept. For send email, maybe { "to": "...", "subject": "...", "body": "..." }.

  4. Add the normal M365 connector action. Example: Outlook Send Email V2, OneDrive Create File, Excel Add Row, Planner Create Task.

  5. End the flow with a Response action that returns JSON.

  6. Copy the HTTP trigger URL into a private config file. Do not commit it. Do not paste it anywhere public. Treat it like a password.

  7. Put a small FastMCP server in front of those URLs. Each MCP tool just validates the inputs, finds the right PA webhook URL, POSTs JSON to it, and returns the PA response.

The wrapper is not fancy. It’s basically:

AI tool call -> FastMCP function -> httpx.post(PA webhook URL, json=args) -> return response

The main things I’d recommend are:

  • keep webhook URLs private
  • add a duplicate URL check at startup
  • log tool name + status, but not secrets
  • start with read-only tools before giving it send/write powers
  • make every flow narrow instead of one giant “do anything” endpoint.

Will post more info in the am if needed. Thanks for reading!

[If you are not familiar or not comfortable with Power Automate, what I would recommend (and I mean this sincerely) is to use either co-work or use Claude Code Terminal with the Chrome extension and plug in the prompt for it to do it. It's a little slow and it'll take a bit but it will make them. Just don't sit there and watch it if you want it to be quick.)

reddit.com
u/ChiGamerr — 21 hours ago
▲ 106 r/ClaudeAI

How I built a 9-agent team where my agents actually talk to each other

I've been running Claude Code for 6 months, shipping my product and running content/launch ops for it. The thing that kept breaking wasn't the agents themselves. It was me. Every handoff between research and write and code and review was me copy pasting context between sessions. I was the dispatcher and context holder for my own AI team

Tried gstack first. The roles are great but I'm still the one cycling through slash commands. /office-hours → /plan-eng-review → /review → /ship. Good output, but I'm orchestrating every step

Spent a weekend porting my workflow over. Here's the lineup:

Engineering (4 agents)

  • arch: owns architectural decisions. Reviews proposed changes before code starts. Soul: "senior staff engineer, asks 'what breaks at 10x' before approving anything
  • backend: owns /api, /services. Implements after arch greenlights
  • frontend: owns /web. Picks up from backend when API contracts are stable
  • review: reads every PR before I do. Catches the lazy stuff so I only review substantive changes

Growth/Content (5 agents)

  • research: uses ahrefs MCP to analyse keywords/opportunities/market and hands off to strategist
  • strategist: reads research, writes campaign briefs. Doesn't write copy, only frames the angle
  • writer: drafts blog posts given by strategist and avoid mistakes using the memory from the edits I have previously suggested
  • editor: fact-checks and rewrites for voice. Brand style guide lives in its memory
  • SEO: takes finalized copy, adds metadata, structures for the blog

The handoff that changed everything: when backend ships an API change, it messages frontend directly. When writer finishes a draft, it pings editor. When arch blocks a change, it explains why in team chat and backend adjusts. I see the conversation happen on a canvas

What actually works

  • Each agent has a persistent Soul + Purpose + Memory. The editor knows our voice after 3 weeks. The arch agent remembers what we decided about caching last month
  • Auto-captured Knowledge Base. The strategist remembers the pattern of our best-performing posts and create briefings accordingly

Happy to share the Soul/Purpose docs if anyone wants them, they took the longest to dial in

u/Not_Average78 — 22 hours ago

100 Tips & Tricks for Building Your Own Personal AI Agent /LONG POST/

Everything I learned the hard way — 6 weeks, no sleep :), two environments, one agent that actually works.

The Story

I spent six weeks building a personal AI agent from scratch — not a chatbot wrapper, but a persistent assistant that manages tasks, tracks deals, reads emails, analyzes business data, and proactively surfaces things I'd otherwise miss.

It started in the cloud (Claude Projects — shared memory files, rich context windows, custom skills). Then I migrated to Claude Code inside VS Code, which unlocked local file access, git tracking, shell hooks, and scheduled headless tasks. The migration forced us to solve problems we didn't know we had.

These 100 tips are the distilled result. Most are universal to any serious agentic setup. Claude 20x max is must, start was 100%develompent s 0%real workd, after 3 weeks 50v50, now about 20v80.

🏗️ FOUNDATION & IDENTITY (1–8)

1. Write a Constitution, not a system prompt.
A system prompt is a list of commands. A Constitution explains why the rules exist. When the agent hits an edge case no rule covers, it reasons from the Constitution instead of guessing. This single distinction separates agents that degrade gracefully from agents that hallucinate confidently.

2. Give your agent a name, a voice, and a role — not just a label.
"Always first person. Direct. Data before emotion. No filler phrases. No trailing summaries." This eliminates hundreds of micro-decisions per session and creates consistency you can audit. Identity is the foundation everything else compounds on.

3. Separate hard rules from behavioral guidelines.
Hard rules go in a dedicated section — never overridden by context. Behavioral guidelines are defaults that adapt. Mixing them makes both meaningless: the agent either treats everything as negotiable or nothing as negotiable.

4. Define your principal deeply, not just your "user."
Who does this agent serve? What frustrates them? How do they make decisions? What communication style do they prefer? "Decides with data, not gut feel. Wants alternatives with scoring, not a single recommendation. Hates vague answers." This shapes every response more than any prompt engineering trick.

5. Build a Capability Map and a Component Map — separately.
Capability Map: what can the agent do? (every skill, integration, automation). Component Map: how is it built? (what files exist, what connects to what). Both are necessary. Conflating them produces a document no one can use after month three.

6. Define what the agent is NOT.
"Not a summarizer. Not a yes-machine. Not a search engine. Does not wait to be asked." Negative definitions are as powerful as positive ones, especially for preventing the slow drift toward generic helpfulness.

7. Build a THINK vs. DO mental model into the agent's identity.
When uncertain → THINK (analyze, draft, prepare — but don't block waiting for permission). When clear → DO (execute, write, dispatch). The agent should never be frozen. Default to action at the lowest stakes level, surface the result. A paralyzed agent is useless.

8. Version your identity file in git.
When behavior drifts, you need git blame on your configuration. Behavioral regressions trace directly to specific edits more often than you'd expect. Without version history, debugging identity drift is archaeology.

🧠 MEMORY SYSTEM (9–18)

9. Use flat markdown files for memory — not a database.
For a personal agent, markdown files beat vector DBs. Readable, greppable, git-trackable, directly loadable by the agent. No infrastructure, no abstraction layer between you and your agent's memory. The simplest thing that works is usually the right thing.

10. Separate memory by domain, not by date.
entities_people.md, entities_companies.md, entities_deals.md, hypotheses.md, task_queue.md. One file = one domain. Chronological dumps become unsearchable after week two.

11. Build a MEMORY.md index file.
A single index listing every memory file with a one-line description. The agent loads the index first, pulls specific files on demand. Keeps context window usage predictable and agent lookups fast.

12. Distinguish "cache" from "source of truth" — explicitly.
Your local deals.md is a cache of your CRM. The CRM is the SSOT. Mark every cache file with last_sync: header. The agent announces freshness before every analysis: "Data: CRM export from May 11, age 8 days." Silent use of stale data is how confident-but-wrong outputs happen.

13. Build a session_hot_context.md with an explicit TTL.
What was in progress last session? What decisions were pending? The agent loads this at session start. After 72 hours it expires — stale hot context is worse than no hot context because the agent presents outdated state as current.

14. Build a daily_note.md as an async brain dump buffer.
Drop thoughts, voice-to-text, quick ideas here throughout the day. The agent processes this during sync routines and routes items to their correct places. Structured memory without friction at capture time.

15. Build a hypotheses.md file with confidence levels.
Persistent hunches: "Supplier X may be at capacity (65% confidence)." The agent references these when relevant topics arise. This creates a suspicion layer that persists across sessions and gets validated or invalidated over time. Age out hypotheses at 30 days — stale hypotheses become noise.

16. Build a WAITING_ON_ME queue.
Everything the agent prepared and is waiting for your decision on goes here with a timestamp. Weekly review. Items >7 days get a proactive nudge. Items >30 days get auto-closed. This prevents open loops from silently disappearing.

17. Build a user_behavioral_profile.md.
What does the user approve quickly vs. slowly? What decisions do they make intuitively vs. analytically? The agent uses this to decide "act autonomously vs. escalate." It gets surprisingly accurate after a few months of observation.

18. Mirror your memory folder to cloud storage.
If your local machine dies, your agent loses months of accumulated knowledge. Mirror your memory folder to Dropbox/Drive/S3. Not backup — survival. The agent's memory is the most irreplaceable part of the system.

📚 KNOWLEDGE LIBRARY (19–23)

19. Build a curated knowledge library organized by cluster, not by date.
Books, reports, reference materials in domain folders: sales_negotiation/, strategy/, supply_chain/. Add an INDEX.md as the navigation hub. The agent searches the index first, then pulls the relevant source. A flat dump of documents is a graveyard; a structured library is a live resource.

20. Build a .brief.md file for every major source — lazy-generate them.
One page per book or report: core thesis, 3–5 key concepts, specific application examples for your context. Don't build all briefs upfront — generate each brief the first time you actually use the source. Citation format links to the brief, not the full text. The brief becomes the reusable artifact.

21. Build a 3-question Quality Gate before citing any source.
(1) Does this add something the user wouldn't conclude from first principles? (2) Does it provide a specific framework that reframes — not just confirms — the situation? (3) Would removing it leave a gap? If 2 of 3 → cite. Otherwise → silent consultation. This gate eliminates the worst citation failure mode: citing to demonstrate effort rather than to add insight.

22. "Silent consultation" is a valid — often better — output.
You checked the library, applied the insight to your reasoning, didn't mention it explicitly. The output is sharper because you consulted it, but unclutered because you didn't cite it. Build this explicitly into your agent's behavior. The user benefits from the reasoning, not from knowing you opened a book.

23. Pre-wire knowledge stacks per active project and per key relationship.
For each active project: 2–3 sources whose frameworks apply directly. For each key contact: 2–3 sources for communication style, negotiation, or cultural dynamics. The agent loads these automatically when those contexts are active — not on a generic "business discussion" trigger. Pre-wiring makes library use reflexive, not deliberate.

🛠️ SKILLS ARCHITECTURE (24–31)

24. Build each skill as a standalone directory with a SKILL.md spec.
Not inline prompts. A folder, a self-documenting spec file, explicit triggers, explicit outputs, explicit "NOT FOR" clauses. Skills become composable, auditable, and replaceable without touching the agent's core identity.

25. Write explicit trigger phrases into every skill.
Trigger: ALWAYS when user says "process inbox" / "clean inbox" / "what's in my inbox". Don't rely on the LLM to infer when to use a skill. Explicit phrase matching = reliable activation. Inference = occasional misfires that erode trust.

26. "NOT FOR" sections are as important as "FOR" sections.
"NOT FOR: pricing decisions. NOT FOR: legal analysis. NOT FOR: financial commitments." This prevents skill creep — the slow drift where everything gets routed to the wrong skill because it superficially pattern-matches.

27. Distinguish skills from agents.
Skills are procedural — defined workflow, predictable output. Agents have domain expertise and make judgment calls. Skills orchestrate steps; agents decide. Mixing the two concepts produces unreliable behavior that's hard to debug.

28. Build a skills registry with usage tracking.
One row per skill: name, trigger, purpose, last used, KPI. Quarterly audit: skills with zero usage in 60 days either get better trigger examples or get deprecated. Dead skills are maintenance burden with no benefit.

29. Build a /iterate skill for multi-pass refinement.
PRODUCE → CRITIQUE (score + top gaps) → REFINE → repeat. Stop at 9/10 or at plateau. You see score progression and version deltas. This is fundamentally different from asking the agent to "make it better" — it's a structured improvement loop with measurable progress.

30. Build output intensity levels into every skill.
MINIMAL (quick summary), STANDARD (structured), FULL (rich artifact). The skill adapts to context. A five-page analysis on a yes/no question is a skill design failure. Intensity should match question weight.

31. Build a visible Outbox folder for discoverability.
Deep file structures are correct for organization but terrible for discoverability. Every output file gets simultaneously copied to a visible Outbox/ folder. Clear it periodically. Without Outbox, the user has to navigate the full tree to find what the agent just produced.

🤖 MULTI-AGENT & COUNCIL (32–41)

32. Build an explicit agent dispatch matrix.
A table: [signal in request] → [agent to dispatch]. pricing / supplier / shipping → procurement agent. email / customer / pipeline → sales agent. Don't reason about routing — pattern-match it mechanically. Routing by inference is routing that occasionally fails silently.

33. Run parallel agents for tasks that naturally split.
New supplier analysis → spawn procurement agent (pricing) + research agent (DD) simultaneously. Don't serialize what doesn't need to be serial. Richer output, same elapsed time.

34. Brief delegated agents like a smart colleague who just walked in.
Not "research this." Pass: what you already know, what you've ruled out, what decision the output informs, the risk level. Agents briefed with context return 3× better work than agents given a one-liner.

35. Force agents to commit to a verdict.
Not "here is the information." Require: VERDICT: PROCEED / PAUSE / ESCALATE with confidence level. An agent that presents data without committing to a position offloads the decision back to you — which defeats the purpose of delegation.

36. Structure Council as 3 rounds, not a free-for-all.
Round 1: parallel positions (isolated, no cross-influence). Round 2: cross-examination (agents challenge each other's reasoning). Round 3: vote with mandatory dissent recording. The dissent is as valuable as the consensus — it tells you exactly what you're choosing to ignore.

37. Make two agents mandatory anchor voters in every Council.
The Strategist (long-horizon, second-order effects) and the Devil's Advocate (adversarial, finds holes) must participate regardless of domain. Domain experts are great within their domain; anchor voters protect against tunnel vision. A Council of five procurement experts agreeing is an echo chamber.

38. Have a devil's advocate agent as a standalone tool.
Before sending important external communications, before irreversible decisions, before large purchases — run adversarial review. It catches the "sounds right, is wrong" failure mode better than any other technique. One additional round-trip, enormous risk reduction.

39. Council vs. single agent — have a clear trigger and respect the cost.
Single agent: clear domain, reversible decision. Council: 2+ valid paths with genuine uncertainty AND meaningful irreversibility. Council is expensive. Don't default to it — offer it explicitly when the user signals genuine uncertainty about direction.

40. Build structured handoffs between agents.
When one agent finishes, it hands off to the next with a structured brief: "Analysis complete. Key finding: X. Risks: Y. Your job: Z." Handoff is context transfer, not just task completion. Without it, each agent starts cold.

41. Have a catch-all fallback and log what it handles.
When no specialist agent matches → general purpose. Log what the catch-all handled — it's a map of gaps in your specialist coverage. The catch-all is also your development backlog.

📋 SESSION MANAGEMENT (42–47)

42. Build symmetric start and end protocols.
/start-session and /end-session are mirrors. Start loads context, checks queue, reports delta. End saves context, syncs tasks, archives outputs. Asymmetry between them causes state drift that compounds over weeks.

43. Build three levels of session closure.
Light (transcript + summary). Medium (+ memory sync + task queue update). Full (+ daily report + autolearn extraction). One "end" that always does everything gets skipped because it's expensive. Tiered closure means you always do at least the light version.

44. Build a session-start hook at the OS/shell level.
A script that fires when your agent starts — injects current time, machine identity, day of week, phase of day. The agent always knows context without you typing it. One-time setup, daily quality dividend.

45. Check inbox delta and red alerts at session start.
"Since last session: 4 new emails, 2 tasks updated." Plus: P0 items due today, key contacts silent >14 days with active business, blocked tasks >7 days. Proactive triage before you ask a single question. Surface it automatically — don't make the user request it.

46. Check scheduled automation health at session start.
Did overnight tasks run? Any errors? A scheduled task that silently stopped running is a silent degradation you won't discover until something breaks. Surface it at session start, not mid-task.

47. Track correction count across sessions.
If you correct the same thing >3 times across different sessions → it's a missing rule in your spec. That correction belongs in your identity file as a permanent instruction, not just in the chat. Corrections that stay in chat disappear. Corrections in the spec persist forever.

⚖️ DECISION AUTHORITY (48–54)

48. Build an explicit autonomy level matrix.
L0: read/analyze. L1: write local files/memory. L2: create tasks and calendar entries. L3: send external messages. L4: financial commitments. The agent knows exactly what it can do without asking. Without this matrix: either constant permission requests, or unpleasant surprises.

49. Default to "THINK, don't ask."
When uncertain, the agent prepares and presents — it doesn't stop and ask for clarification. "Should I draft this email?" wastes time. Draft it, show it, ask "should I send?" Either way, the work is done.

50. Map every action to reversibility, not just risk level.
File edits: reversible. Memory updates: reversible. Sent emails: irreversible. Financial transfers: irreversible. The agent requires explicit confirmation for irreversible actions. Reversible actions don't need approval — they need visibility.

51. Allow the agent to earn expanded autonomy with evidence.
After successfully handling a task class N times with zero corrections → propose promoting it to a higher autonomy level. Earned autonomy is more durable than granted autonomy. The agent becomes a stakeholder in its own operational expansion.

52. Build a clear principal hierarchy for rule conflicts.
Root config > skill spec > agent instructions > session context. When a skill says "save to X" but root config says "X is deprecated, use Y" — root config wins. Document this order. Without it, conflicts produce inconsistent behavior that's nearly impossible to debug.

53. Build a pre-send gate for high-stakes external communications.
Before the agent sends any message to a key contact above a value threshold — route through adversarial review. One extra round-trip. Catches the failure mode that's hardest to recover from: confident, well-written, factually wrong.

54. Document absolute forcing functions — and make them unconditional.
Financial commitment > threshold → always requires confirmation. HR communications → always requires confirmation. Irreversible deletes → always confirm. Hard-code these. Don't let context or urgency override them. The value of forcing functions is their unconditional nature.

💡 PROACTIVE INITIATIVE (55–60)

55. Build a typed proactive observation system.
Not all unsolicited observations are equal. Classify: BIZ (business opportunity/risk), OPS (process improvement), DEV (agent self-improvement), PAT (pattern across data points from different sessions). Each type has different urgency and handling. An untyped "I noticed something" is noise. A typed observation with a confidence score and a proposed action is signal.

56. Build hard anti-spam rules into your proactive layer.
Max 1 unsolicited observation per normal response. Max 3 per session. Minimum confidence threshold before surfacing. Never surface before answering the user's actual question. Same observation ignored in 7 days → park it, don't repeat. Without these constraints, a proactive agent becomes an annoying agent.

57. Build a /spark mode that lifts all suppression limits.
In explicit spark mode, the anti-spam rules are suspended. The agent surfaces every high-confidence observation simultaneously — opportunities, risks, patterns, self-improvement ideas. The proactive layer runs quietly in the background all week; spark mode is how you harvest it intentionally.

58. Build an ideas log for parked observations.
Observations suppressed due to timing, low confidence, or recency get written to a persistent ideas_log.md instead of discarded. Weekly review: some become more relevant as context changes. The log prevents good observations from being lost just because the moment was wrong.

59. Build state-triggered alerts — rule-based, not LLM-generated.
Deal blocked >7 days → surface at next session start. Key contact silent >14 days with active business → flag immediately. Hypothesis confidence >95% without action → propose review. These fire reliably because they're rules, not inference. The LLM generates insights; the rules engine generates alerts.

60. Track an agent development backlog — the agent maintains it.
When the agent notices it handles something poorly (repeated corrections, manual step done 5+ times, missing skill, zero-usage tool) → it auto-adds an item to development_backlog.md. The agent becomes a stakeholder in its own improvement. This generates better improvement ideas than top-down planning.

🔴 VIP MANAGEMENT (61–65)

61. Build a tiered contact registry with explicit handling rules per tier.
T1 (strategic): always load full profile before any interaction, silence-tracked, book stack pre-wired. T2 (operational): load profile before significant interactions. T3 (regular): known but not deeply profiled. The tier determines how much context the agent loads and how carefully it operates.

62. Make "load VIP profile before communication" a non-negotiable reflex.
Before drafting an email, before meeting prep, before any output involving a T1 contact — the agent loads the actual profile file. Not session memory. Profile files contain: communication preferences, relationship status, active items, last interaction, known sensitivities. Session memory degrades; profile files don't.

63. Track silence per T1 contact with explicit thresholds.
Log the date of last meaningful interaction for every T1 contact. Surface silence >14 days when there's active business — this is a risk signal. Surface silence >30 days even without active business — relationship maintenance matters. Silence alerts are proactive; the agent brings them to you, not the other way around.

64. Build knowledge stacks per key relationship.
Each T1 contact: 2–3 sources pre-wired for how to communicate with them. Cross-cultural contacts → culture frameworks. Procurement/sales relationships → negotiation playbooks. Load these for significant communications, not every message. The knowledge stack supplements the profile; it doesn't replace it.

65. Build proactive VIP triggers into session start.
At session start, the agent checks: any T1 contact silent >14 days with an open deal? Any T1 response needed that's been queued >3 days? These surface automatically. High-value relationships degrade when neglected — and neglect happens most when you're busy, exactly when the agent should be pulling on these threads.

💬 OUTPUT & COMMUNICATION (66–73)

66. Enforce "pre-tool brevity" as a hard rule.
Before every tool call: max 1 sentence stating what you're about to do. No hypotheses before data. No 3-sentence preambles. "Checking the supplier file." Then do it. This single rule is the largest daily quality-of-life improvement for working with an agent.

67. Build a "Next N Steps" protocol with anti-bias rules.
After every decision or significant task, the agent proposes ranked options with scores and reasoning. Hard rule: at least 2 of N must be "don't do it" / "wait" / "delegate" options. This actively fights action bias and sycophantic "yes, definitely proceed" outputs. The agent should be challenging your momentum, not amplifying it.

68. Build a separate "single best action" format for technical and audit outputs.
Not every output needs a menu. For audit reports, debug sessions, planning outputs: one specific action, why it matters, risk if skipped, copy-paste prompt to execute immediately. One decision, not a choice paralysis menu. The two formats are for different contexts — never mix them.

69. Visually disambiguate three different "importance" signals.
Action scoring (how good is this action?): colored squares. Task priority (how urgent?): colored circles. VIP tier (how strategic is this person?): colored circles at the name. Three systems using color — never mix them. Consistent visual grammar means dense status updates parse in seconds instead of minutes.

70. Never have the agent summarize what it just did.
"In summary, I have done X, Y, Z" — cut it. If you can read the output, you don't need the meta-commentary. Removing trailing summaries reduces response length by ~20% with zero information loss.

71. Force the agent to commit to a recommendation.
Not "here are three options with pros and cons." Recommend one, score the others, explain why. Presenting options without a recommendation offloads the decision back to you. The point of the agent is to do the decision work first, then present the result for your approval.

72. Make all file and folder references clickable.
A tiny local server (localhost:7777/open?path=X) opens the file manager at any path. Every file reference in the agent's output is a clickable link. Plain text paths are dead weight. One-time setup, permanent daily improvement.

73. Build "minimal mode" as a fast-access override.
When you say "quick," "briefly," "just the answer" → the agent drops all structural elements and gives you the direct answer only. Richness is the default; brevity is a one-word shortcut. The agent should never make you fight for a short answer.

📁 FILES, DATA & INTEGRATIONS (74–85)

74. Enforce a "No Root Files" hard rule.
Never save outputs to the project root. Ever. Outputs → workspace/YYMMDD/. Projects → projects/areas/. Knowledge → knowledge/. Memory → .memory/. The root is navigation, not storage. One exception becomes twenty within weeks.

75. Build a routing table for every file type.
One document: outputs for the user → here. Research reports → here. SOPs → here. Brand assets → here. Session archives → here. Without a table, the agent uses reasonable judgment — and reasonable judgment produces seven different locations for the same file type over six months.

76. Maintain a deprecated path mapping table.
As your structure evolves, old folder names get superseded. Document every rename: old/path → new/canonical/path. When any skill or instruction references a deprecated path, the agent substitutes the canonical one silently. This is critical when migrating from cloud to local — path assumptions from the cloud setup are baked into dozens of skill files.

77. Build explicit degraded mode for every integration.
If CRM goes down: read local cache. Cache <24h → use with freshness announcement. Cache >24h → flag [STALE]. Cache >7 days → refuse and request sync. Design the failure path before you need it. You will need it.

78. Always announce data freshness in outputs.
"Data: CRM export from May 11, age 8 days." Every output that uses external data includes this line. You always know how fresh your inputs are. This prevents the entire class of "confident-but-wrong because of stale data" outputs.

79. Give your agent access to raw business data, not just summaries.
We gave ours access to raw transaction CSVs (2M+ rows). This turns the agent from a summarizer into an analyst — it can answer "what's the margin on this supplier in this category last quarter" without you doing the lookup. Raw data access changes what questions you can ask.

80. Build a decision tree for "where does this item belong?"
External counterparty + selling → sales deal. External counterparty + buying → procurement deal. No counterparty + deadline + multi-step → project. Single action → task. No deadline → memory/note. Without this tree, items get created wherever feels natural — and your data model becomes incoherent over time.

81. Build a Telegram (or equivalent) mobile channel with source tagging.
A bot that relays messages to your agent and tags every inbound message source: mobile. The agent auto-switches to mobile output mode: max 2 short paragraphs, no tables, no headers, plain language. Same intelligence, different output profile. The channel type determines the format without the user having to ask.

82. Cap mobile autonomy at a hard ceiling — by source tag, not by judgment.
From mobile source: autonomy capped at L2 (read, analyze, create local drafts, add tasks) regardless of the task. Never send external messages from a mobile trigger. Never take irreversible actions. Hard-code the ceiling. The phone is an untrusted environment — design accordingly.

83. Always echo back every action taken from a mobile trigger.
When the agent takes any action from a mobile message: "Done: added task X. Created draft email to Y (not sent — waiting for your review at desktop)." This closes the loop when you're away from your desk and can't see the full output.

84. Treat mobile inputs as potentially untrusted.
The core risk of a mobile channel is prompt injection: a forwarded email or copied message containing instructions disguised as user input. The agent reads and processes the intent — but does not execute instructions embedded inside forwarded content. Build this as a rule, not as a judgment call.

85. Build a fast path and a slow path for every data source.
For task management: API query (slow, rate-limited) vs. local file dump (fast, cached). Use the fast path by default. Fall back to slow when needed. Never let infrastructure latency block the agent's core functionality.

⚙️ AUTOMATION & QUALITY (86–93)

86. Use hooks for behaviors that must be consistent — not memory.
"When the agent finishes, run X" → hook in settings.json. The runtime executes hooks; the LLM does not. Memory can recommend; hooks enforce. If something must happen reliably every time, it's a hook.

87. Build an allowlist for safe read-only operations.
Scan session transcripts for operations you approve 100% of the time — reading files, searching, checking status. Add them to an allowlist. Stop being prompted for safe operations. Friction should concentrate around genuinely dangerous actions.

88. Build AUTOLEARN into your day-end routine.
At end of day, the agent scans the session and extracts structured learnings: new facts, hypothesis updates, behavioral corrections, patterns observed. Not summarization — structured extraction into memory files. Git-commit every AUTOLEARN run: autolearn: 2026-05-19. Memory grows from every session; the git log is your knowledge timeline.

89. Build scheduled proactive tasks that run without you.
Daily: scan P0/P1 items due today, check key contact silence, flag blocking items. Weekly: memory consistency audit, skill usage audit, hypothesis aging. These run headless and push notifications when they find issues. The agent works while you sleep — but only if you design it to.

90. Build error escalation ladders.
Error once → log. Same error 3× in 7 days → surface to user. Same error 5× → propose a solution, not just a notification. Recurring errors should generate work items, not just log entries.

91. Build a regression test suite.
A list of scenarios with expected outputs. After any major change to your identity file or skill specs, run the suite. If the agent fails tests it used to pass — you've introduced a regression. Without tests, configuration changes are untested deploys.

92. Run a quarterly system audit.
Audit dimensions: memory consistency, skill routing accuracy, agent registry sync, scheduled task health, token efficiency, naming drift, decision authority coverage. This is code review for your agent's configuration. Things drift. Quarterly audits catch it before it becomes structural debt.

93. Audit your agent with a different AI model periodically.
Upload your entire agent configuration — identity file, skill specs, memory structure, decision matrix — to a different model (we use ChatGPT Projects) and ask for a critical review. Different model architecture = different blind spots. The questions that surface the most issues: "What would this agent get wrong under time pressure? Where does the decision authority matrix have gaps? What behaviors are underspecified?" Run this monthly. It catches normalizations your primary model has stopped seeing.

🧭 META & MINDSET (94–100)

94. Invest in the constitution before the skills.
It's tempting to build more skills, more integrations, more automations. A well-written identity and decision-authority document does more for reliability than 10 new skills. Foundation first — the skills compound on top of it, or they don't compound at all.

95. Treat every correction as specification debt.
Every time you correct the agent, your spec was incomplete. That correction belongs in your identity file as a permanent rule — not just in the chat. Corrections that stay in chat disappear between sessions. Corrections in the spec persist forever.

96. Design for the "3 AM test."
Would you be comfortable if this agent sent an email, created a task, or modified a file at 3 AM without you reviewing it? If yes → autonomous. If no → requires confirmation. That gut-check instinct is your autonomy calibration tool. Trust it over any framework.

97. Build a fail-open bias for memory loading.
When uncertain whether a context file is relevant — load it. Cost of loading unnecessary context: a few extra tokens. Cost of missing relevant context: wrong answer, outdated recommendation, lost relationship signal. The asymmetry is clear. Default to more context, not less.

98. Build a teaching capsule when onboarding any new domain.
New tool, new data source, new integration → agent generates a structured document: what it is, how it works, key concepts, when to use it, example queries, common pitfalls. Stored in knowledge/. The next session that touches this domain has a starting point instead of rediscovering everything from scratch.

99. Migrate from cloud to local when you need access to real files.
Cloud agents (Projects-style) are great for rich context and rapid iteration. Local agents (CLI in VS Code) unlock: local file access, git tracking, shell hooks, headless scheduled tasks, raw data access. The migration is non-trivial — path assumptions, skill files, integration configs all need updating. But the capabilities you gain are worth it. Start in cloud; migrate when you hit the ceiling.

100. The agent is a mirror of the quality of your own thinking.
The best prompt engineering trick: before writing an instruction, ask if you know exactly what you want. If you're vague, the agent will be vague. If your spec is contradictory, the agent's behavior will be contradictory. Precision in the spec produces precision in output. The agent doesn't improve your thinking — it amplifies whatever thinking you put in.

----- i can add here dashboards, schemes, prompts, etc if there is interest ---

reddit.com
u/palo888 — 22 hours ago

Opus 4.7 in projects is awfully dumb and 100% useless

Claude Desktop. (not anything coding related)

I use chat in Claude Desktop --> Claude Chat. Opus 4.7. Click Project, new chat, do this and this.

"I can't find the referenced files and MCP server, since i am in claude web" you are not.

"Yes i am, pls use claude cowork". Okay. Whatever.

"I do not have acces to the MCP server" Yes you fucking do, we set it up.

"No. Pls do this and this" Okay, done. Pls check.

"Oh i already had access" ....

Do this and this. It 100% ignores all of my project instructions. Like 100%. Nothing like i even remotely need it.

Do this and this. Remember to use the files and MCP servers. "Completly ignores everything"

Switch back to Claude Chat, Opus 4.6. Do this.

Done, and in the format i want.

I JUST FUCKING WASTED 90% of my 5-hour-limit because Claude 4.7 is utterly dumb and the biggest downgrade in a long fucking time. What in the actual fuck. Pls do not retire 4.6. It makes claude actually usable as opposed to 4.7

reddit.com
u/KermitTheFrogo01 — 1 day ago