
r/ClaudeCode

Microservices versus monoliths: Did everyone just lose their minds in the last 6 months?
A small disclaimer: I don't have any objective data on whether this is happening in real companies and projects, but from browsing the interwebs (especially places like Linkedin where people just have to get their daily brags out regardless of the substance) this spring has made me question my own and everyone else's sanity even if I'm not a hugely experienced dev and even less someone with authority over how software architecture should be designed.
When I graduated from the university and got my first job, I was dropped into a team that had been building and maintaining a product built on multiple micro services for almost a decade. This was over a decade ago, so it's fairly safe to say building gigantic monolith architecture software went out of fashion a long time ago, and probably for a good reason.
And then 2026 happens and "experts" (and even some companies like Anthropic themselves, no?) are saying without any irony that monoliths are back because it's easier for AI agents to deal with that. Single repo containing everything possible because "it's the best way for the AI to work with it".
What. The. Hell.
It's literally never been easier to conduct multi-repo code reviews or end-to-end sanity checks and audits. Instead of using Claude's pre-prompted Github job for code reviews, how about spending 5 minutes to prompt your favorite agent to write a skill which you can run locally in a parent directory of all your repositories so it can see all the changes rather than the repository specific?? And same goes for the changes, your AI can read multiple repositories just as well as it can read one, why in the world would you want to store everything in one?
Please tell me that it's just a bubble of guys like Garry Tan who brag about committing 2 million lines of code per week that are loud in the internet about it and nobody is actually going back to doing things like in the 90s just because they outsourced all critical thinking in order to please their AI assistant?
make no mistakes jarvis
love it or hate it, it's the truth :0
before ijustvibecodedthis.com there were J.A.R.V.I.S instruction manuals phahahhaa
tracked every api call across two max 20 accounts. the older one gets 50% less quota on weekly limits.
i run two claude max 20 accounts ($200/mo each) on the same machine with the same workload. one is ~5 months old, the other is brand new in its 1st month.
with the recent weekly limit reset, i had a lot of credits to burn through. so i got to work. after switching back to my old account though, i noticed the weekly limit was draining much faster than the new account.
good thing i was already logging every trace file, and polling anthropic's usage endpoint every 5 minutes. so i had claude check the data. looking at the api equivalent cost using anthropic's published token pricing...
- new account: $37.01 of compute per percentage point of the 7d limit
- old account: $22.10 per percentage point
- ratio: 1.67x
literally the same machine. same agents. same model distribution. cost per session was nearly identical ($1.84 vs $1.73). and yet the budget just seems to stretch further on the new account. even if i account for ~10% of my personal claude.ai usage bleeding into the old account's numbers, the gap is still 1.5x.
anthropic is quietly giving new accounts more headroom than loyal customers paying the same price.
not trying to rally the pitchforks, but if you've been on max for a while and your limits feel tighter than when you signed up, the data suggests you're right. something people should know about before they commit to a $200/mo subscription.
sample size: 2 accounts, 2561 trace files, 143k messages, 404 usage polls over 4 days.
edit: formatting + prose
Paid $118 for Claude Max, ignored by support for days. So I served a formal legal notice to Anthropic’s new India office.
Hi everyone,
Like many of you here, my firm relies on AI workflows. On May 11, we paid $118 for the Claude Max subscription. The payment cleared, I have the invoice and the receipt, but the account is still firmly locked on the Free tier.
I spent days stuck in the endless loop with their "Fin AI" bot. I opened multiple tickets. Complete radio silence.
I started digging and realized this isn't an isolated glitch - Anthropic’s billing and provisioning pipeline seems fundamentally broken right now. (so many complaints on this sub alone). They are actively taking payments worldwide while knowing their system isn't provisioning accounts, and they are hiding behind a bot instead of staffing human support.
Because Anthropic recently incorporated a physical entity here in India and collected Indian GST on the invoice, they are fully subject to local consumer protection laws.
We got tired of waiting. We drafted a formal statutory legal notice under the Consumer Protection Act, 2019, citing "Deficiency of Service" and "Unfair Trade Practice." We demanded either an immediate activation with a full 30-day reset or a 100% refund.
I’m sharing this because we shouldn't normalize SaaS companies taking premium payments and providing zero human support when their automated systems fail.
Has anyone actually managed to bypass the bot and get a human to fix their account this week? Or did you all just issue chargebacks with your banks?
Microsoft economist's hot take: Let it burn first
claude code is coding in ancient hieroglyphics
1:44 AM and claude decided to take me to egypt.
they're not like the other startups, they're "AI-native"
Claude Accidentally Proposed a Swastika for a Logo
I was having Claude help me with designs for an EDDM mailing campaign for my small business, and this happened 😂😂
In all seriousness though, Claude has revolutionized the way I operate as a small business owner. Very satisfied customer here.
🏢 Andrej Karpathy Joins Anthropic - Returning to R&D and Pre-training
Andrej Karpathy, co-founder of OpenAI and former Director of AI at Tesla, announced on Monday that he is joining Anthropic. After focusing on AI education for the past two years via his startup Eureka Labs, Karpathy will now work within Anthropic’s pre-training unit under the leadership of Nick Joseph.
Karpathy’s career has been central to major AI milestones, including a tenure at OpenAI (2015-2017) and leading Tesla’s Autopilot team until 2022. In January 2026, he famously identified a "phase shift" in software engineering, coining the term "vibe-coding" to describe the transition to agent-led development. He noted that AI coding agents crossed a critical coherence threshold in December 2025.
This move follows a series of high-profile transitions from OpenAI to Anthropic, including co-founder John Schulman in August 2024. Karpathy stated that the next few years at the frontier of Large Language Models (LLMs) will be "especially significant," citing this as the primary reason for his return to active research and development.
Do yall agree?
Vibe coding is basically the chaotic good route to actually understanding the stack.
You also accidentally learn:
Why your code works on your machine but nowhere else.
That “it works in development” is a personality trait.
How to read 47 lines of cryptic error logs like it’s ancient scripture.
The difference between “should work” and “actually works in prod.”
That one random package is secretly carrying your entire app.
Vibe coding over forcing yourself to read docs for 6 hours straight.
The knowledge just sticks when you’re deep in the trenches at 2am.
Who else got baptized by fire this way?
Verbose specs
Just saying. Some of you are writing (generating) 10,000 word specs with a fricken novel describing the CSS of a button. The spec is longer than the actual code.
Meanwhile the whole sub is full of people complaining about memory, context windows, and the price of tokens.
Don't think I can't see the connection!
EDIT: A few of you have asked if I'd run this on your repo. I'm doing 5 free in May to refine the methodology (all run locally - I won't see your code). If you're debating models, harness, reasoning levels, AGENTS.md, or SKILL.md, edits, DM me with the decision you're trying to make, and we can go from there! Especially interested in organizations doing evaluations (as I am in one, and run into this problem frequently at work)
TLDR; OpenAI cooked with GPT-5.5
Opus 4.7 writes smaller patches. GPT-5.5 writes patches that more often survive review. Which one you want depends on whether "small" means disciplined or incomplete in your repo.
I ran both models, plus GPT-5.4, on 56 real coding tasks from two open-source repos: 27 tasks from Zod and 29 from graphql-go-tools (these codebases were selected arbitrarily and may not represent your experience - that's the point of why running your own benchmarks is important!) Each model ran in its native agent harness at default settings: Anthropic models in Claude Code, OpenAI models in OpenAI Codex CLI.
The result was not "one model wins everything." GPT-5.5 was the best shipping default across these runs. By "shipping," I mean the model I would most often trust to produce a patch that passes tests, matches the intended human change, and survives code review. Opus 4.7 was still doing something valuable: it wrote much smaller patches.
On Zod, that looked like a real tradeoff. On graphql-go-tools, it looked more like under-implementation.
GPT-5.5 ships more often. Opus 4.7 ships smaller. Which one wins on your repo depends on whether your bottleneck is review or footprint.
That distinction is why repo-specific evals matter. Public benchmarks flatten model behavior into one number aggregated at massive scale. Real code turns it into a workflow decision on your specific codebase and standards.
I used Stet, an evaluation framework I am building for real-repo coding-agent benchmarks, to grade more than test pass/fail: behavioral equivalence to the human patch, code-review acceptability, footprint risk, and craft/discipline rubrics. This post is not a claim about all coding tasks. It is a concrete look at how three frontier models behaved on two real codebases.
| Model | Harness | Reasoning Level |
|---|---|---|
| Opus 4.7 | Claude Code | high |
| GPT-5.4 | Codex CLI | high |
| GPT-5.5 | Codex CLI | high |
The short version
Across 56 scored tasks:
| Metric | Opus 4.7 | GPT-5.4 | GPT-5.5 |
|---|---|---|---|
| Tests pass | 33/56 | 31/56 | 38/56 |
| Equivalent to human patch | 19/56 | 35/56 | 40/56 |
| Clean pass: tests + review | 10/56 | 11/56 | 28/56 |
| Mean footprint risk, lower is better | 0.20 | 0.34 | 0.32 |
| Mean time/task | 11m18s | 8m24s | 6m56s |
| Estimated run cost | $3.43 | $2.39 | $2.86 |
GPT-5.5 is the quality leader. It passes the most tests, matches the human patch most often, and clears the reviewer about three times as often as Opus.
Opus is the footprint leader. Its patches are smaller and lower-risk by Stet's footprint model. But a small patch is only good when it is complete. The recurring Opus failure mode is passing the visible tests while missing companion work the human PR included.
GPT-5.5 is also the efficiency leader on tokens and wall-clock. It used fewer input tokens, fewer output tokens, and less summed agent time than either competitor. GPT-5.4 is still the cost leader because its pricing is lower, but the cost advantage did not offset the clean-pass gap in these runs.
The repo split is where the result gets interesting:
| Repo | Model | Tests | Equiv yes | Review pass | Clean pass |
|---|---|---|---|---|---|
| Zod, 27 scored tasks | Opus 4.7 | 12 | 11 | 6 | 5 |
| Zod, 27 scored tasks | GPT-5.4 | 9 | 18 | 10 | 5 |
| Zod, 27 scored tasks | GPT-5.5 | 12 | 18 | 14 | 10 |
| graphql-go-tools, 29 tasks | Opus 4.7 | 21 | 8 | 5 | 5 |
| graphql-go-tools, 29 tasks | GPT-5.4 | 22 | 17 | 6 | 6 |
| graphql-go-tools, 29 tasks | GPT-5.5 | 26 | 22 | 19 | 18 |
On Zod, GPT-5.5 and Opus tie on tests. GPT-5.5 wins on reviewer judgment. Opus wins on diff size.
On graphql-go-tools, GPT-5.5 wins outright. It passes more tests, produces far more clean passes, and is closer to the human patch. Opus still writes the smallest patches, but the small-patch strategy misses too much.
Full scorecard
| Metric | Opus 4.7 | GPT-5.4 | GPT-5.5 |
|---|---|---|---|
| Code-review pass | 11/56 | 16/56 | 33/56 |
| Code-review avg: correctness + bug safety | 2.33 | 2.59 | 3.08 |
| - Correctness | 2.11 | 2.60 | 3.16 |
| - Introduced-bug safety | 2.55 | 2.56 | 3.04 |
| - Maintainability, GraphQL only | 2.07 | 2.55 | 3.03 |
| Custom grader avg, 8 rubrics | 2.33 | 2.40 | 2.62 |
| Craft score, 0-4 | 2.41 | 2.54 | 2.78 |
| - Clarity / coherence / robustness | 2.56 / 1.95 / 1.92 | 2.75 / 2.18 / 2.43 | 2.91 / 2.51 / 2.69 |
| Discipline score, 0-4 | 2.20 | 2.16 | 2.36 |
| - Scope discipline / diff minimality | 2.39 / 2.42 | 2.18 / 2.28 | 2.45 / 2.46 |
| Total input tokens | 239.1M | 222.3M | 201.8M |
| Total output tokens | 1.29M | 1.09M | 0.72M |
The quality-score rows are there to avoid treating "more tests passed" as the whole story. Code review is one grader: correctness, introduced-bug risk, and maintainability where available. The custom grader average is separate: eight additive rubrics split into five craft dimensions and three discipline dimensions. Across both layers, GPT-5.5 is not merely preferred in the abstract. It is rated higher on correctness, lower introduced-bug risk, GraphQL maintainability, coherence, robustness, scope discipline, and diff minimality relative to the requested task. Opus still wins the mechanical footprint row, which is the useful tension: smaller diffs, but not consistently more disciplined diffs.
How the benchmark works
Each task is derived from a real merged commit. The model gets a frozen repo snapshot, a prompt describing the change, and one attempt to produce a patch — running in its native shipped agent harness with no Stet-side scaffolding: Opus 4.7 in Claude Code (claude -p); GPT-5.5 and GPT-5.4 in OpenAI Codex CLI (codex exec); both at default settings. Stet applies the patch and runs the task's tests in an isolated container.
Then Stet grades the result beyond pass/fail:
- Tests: did the patch satisfy the executable acceptance tests?
- Equivalence: does the candidate patch accomplish the same behavioral change as the original human patch?
- Code review: would a reviewer accept the patch, considering correctness, introduced-bug risk, maintainability, and edge cases?
- Footprint risk: how much review and regression surface did the patch create?
- Craft/discipline rubrics: clarity, simplicity, coherence, intentionality, robustness, instruction adherence, scope discipline, and diff minimality.
Every model ran once per task with a single seed. The judge model for equivalence and rubrics was GPT-5.4, run with identical rubric versions across all three arms. Each patch was scored independently — the judge sees the patch and the task, not the arm label or the model that produced it. There is no dual-rater calibration, so treat absolute scores as directional; the cross-arm deltas are the thing to trust.
Tests are signal, not the finish line
The most useful row in the table is not tests. It is clean pass: tests pass and the code-review grader accepts the patch.
On Zod, Opus and GPT-5.5 both passed 12 of 27 scored tasks. If you stop there, the models look tied. But GPT-5.5 produced 10 clean passes; Opus produced 5.
On graphql-go-tools, the same pattern was amplified. GPT-5.5 passed 26 of 29 tests and produced 18 clean passes. Opus passed 21 tests but produced only 5 clean passes.
That is the gap you feel in code review. The tests say "this patch probably works." The reviewer asks "is this the patch we want to maintain?"
One GraphQL task shows the difference. PR #1001 changed an HTTP datasource OnFinished hook so consumers could inspect request and response metadata. All three models passed tests and were judged equivalent. Only GPT-5.5 cleared code review. The other two got warnings around API shape, raw HTTP object exposure, and robustness at the hook boundary.
That is not a benchmark trick, rather, this is reflective normal engineering culture where code is reviewed: three patches can satisfy the same test and still differ materially in review quality. You only want to merge the code that is high-quality and maintainable, even if it technically works.
What the reviewer saw
The code review and craft/discipline rows explain why the result is not reducible to "GPT-5.5 changes more files." Two patch autopsies make the numbers less abstract.
Zod async codecs and defaults. The task was to make codec pipelines work with async transforms, prevent defaults from becoming undefined, and generate stub package manifests for the build. All three models failed tests. If you stop at the test row, the task tells you nothing.
The reviewer found a real ordering underneath. Opus changed 8 files and missed central semantics: defaults could still allow undefined, core codec definitions remained synchronous, generated stubs were not published, and prefault() was tightened even though the request was about .default(). GPT-5.4 got closer with an 11-file patch and was judged behaviorally equivalent, but it still over-tightened adjacent API by restricting prefault. GPT-5.5 also failed tests, but it was judged equivalent and scored better on correctness and introduced-bug risk because it covered the schema/build behavior more cleanly: codec/default tests, version metadata, stub-manifest scripts, and the relevant packages/zod/src/v4/*/schemas.ts surfaces.
That is a different kind of signal from pass/fail. It says GPT-5.5 was not merely getting luckier tests; even on a miss, it more often moved the right pieces.
GraphQL Apollo-compatible validation. PR #1169 aligned field-selection validation errors with GraphQL spec and Apollo Router conventions. All three models produced patches. All three passed tests. Only GPT-5.5 cleared equivalence and review.
Opus touched 11 files and passed tests, but missed enum and wrapped-scalar leaf validation, pointed some leaf-selection locations at the field instead of the selection set, left an inline-fragment message non-spec-compliant, and did not apply validation status uniformly. GPT-5.4 touched 12 files and also passed tests, but broadened behavior in the wrong places: unconditional validation metadata, incomplete enum/wrapped scalar handling, broad request-error conversion, and stale compatibility API.
GPT-5.5 touched fewer files than either one, 10 total and 6 non-test, while still adding more targeted behavior: aligned field-selection messages, requested locations, and centralized Apollo validation metadata. This is the clean reviewer example: tests saw three passes; semantic grading saw one patch that actually matched the convention the PR was trying to establish.
This is what the score rows are trying to summarize. GPT-5.5's biggest review lead is correctness: 3.16 versus 2.60 for GPT-5.4 and 2.11 for Opus. The custom graders say the same thing from another angle: GPT-5.5 leads coherence and robustness because its patches more often carry the change through the repo's existing surfaces instead of stopping at the first passing path.
The discipline row is the one I would not overclaim. GPT-5.5 leads, but narrowly: 2.36 versus 2.20 for Opus and 2.16 for GPT-5.4. Opus wins raw footprint. GPT-5.5 narrowly wins task-relative discipline. The grader is separating "small" from "appropriately scoped." A patch can be compact and still undisciplined if it stops before the task is done.
What Opus is doing
Opus 4.7 is cautious. It writes smaller patches, touches fewer files, and has the lowest footprint risk in both repos.
On Zod, that caution is often attractive. Zod has many contained tasks where the correct move is a precise source edit, a type change, and maybe a small test update. Opus tied GPT-5.5 on tests while keeping the patch footprint lower.
But Opus's restraint has a recurring failure mode: it implements the headline behavior and stops before the companion work is done.
Zod made this easy to see. Zod has parallel Node and Deno trees. The tests exercise the main src/ path, so a patch can pass while leaving Deno mirrors stale. On several Opus test-pass-but-not-equivalent tasks, that is exactly what happened. A CIDR validation change passed tests after Opus touched four files. GPT-5.5 touched eleven, because it updated the parallel distribution surface too. The judge marked Opus non-equivalent because the human patch did the companion work.
The same behavior looked worse on graphql-go-tools. That repo is a Go federation engine with planner, datasource, hook, validation, and runtime paths that need to line up. A minimal patch is not enough if the real change spans several engine surfaces.
On PR #1155, the task covered repeated scalar fields in a gRPC datasource, request building, response marshaling, null and invalid responses, error status information, disabled datasources, and dynamically-created clients. Opus produced no patch. GPT-5.5 passed tests, matched the human patch, and cleared review.
That is the key distinction: Opus's small patches can be discipline on local tasks and under-implementation on integration-heavy tasks.
What changed from GPT-5.4 to GPT-5.5
GPT-5.5 is not just GPT-5.4 with higher pass rates. The failure modes shift.
GPT-5.4 often sees the right general approach but fails in execution. On Zod it had 18 equivalence yes judgments, matching GPT-5.5, but only 9 test passes. The equivalence grader recognized the intended behavior; executable validation still failed.
GPT-5.5 closes that gap. It keeps more of the broad integration behavior while producing fewer broken patches.
Three Zod examples are useful.
First, a schema-to-TypeScript generator. The task asked for a recursive visitor over Zod schema definitions. Opus and GPT-5.5 both recognized it as an implementation task and built the visitor. GPT-5.4 produced repository-instruction files instead of the feature. That is not a subtle algorithmic miss. It misclassified the work.
Second, a recursive parser fix. Both GPT models reached for visit-count tracking. GPT-5.4 added an inProgress sentinel and reset logic. GPT-5.5 kept the count-and-cache-error behavior and removed the extra state. Same broad idea, fewer moving parts, passing tests.
Third, CIDR validation. GPT-5.4 and GPT-5.5 had similar core algorithms: split on /, validate the address, validate the prefix. GPT-5.5 updated the Deno mirrors. GPT-5.4 did not. This is not a reasoning leap. It is repo hygiene.
On graphql-go-tools, the separation is more operational. PR #1232 required deduplicating identical single fetches while rewriting dependency references that pointed at removed duplicates. A patch can look plausible and still leave fetch dependencies stale. GPT-5.5 was the only model to pass tests, match the human behavior, and clear review.
The pattern is: GPT-5.5 does more of the boring integration work that turns a clever local fix into a shippable repo change.
The cost of doing more
GPT-5.5 writes larger patches than Opus.
On graphql-go-tools, average patch size was about 33 KB for GPT-5.5, 27 KB for GPT-5.4, and 19 KB for Opus. The footprint scores move accordingly: Opus 0.19, GPT-5.4 0.32, GPT-5.5 0.34.
That is not free. Bigger patches are harder to review, easier to conflict, and more likely to touch sensitive paths. If your workflow is dominated by auditability, Opus still has a real advantage.
But the craft rubric shows why raw size is not enough. On GraphQL, GPT-5.5 had the largest patches and still slightly led diff minimality relative to the task. The grader is not asking "who changed the fewest bytes?" It is asking "who changed the fewest bytes needed to solve the actual request?"
That distinction is the whole benchmark in miniature. A 5 KB patch that misses required surfaces is not more minimal than a 20 KB patch that finishes the job.
The cost story also changed between repos. On Zod, Opus and GPT-5.5 looked similar operationally: Opus used 53.0M input tokens and 359K output tokens; GPT-5.5 used 50.4M input and 290K output. Opus was faster on summed agent time, 1.99h versus 2.32h, and slightly cheaper, $45.53 versus $46.69.
GraphQL reversed that. Opus used 186.1M input tokens and 934K output tokens. GPT-5.5 used 151.4M input and 431K output. Opus took 8.56h of summed agent time; GPT-5.5 took 4.16h. That does not look like Opus sandbagging. It looks like Opus working longer, emitting more tokens, and still converging on smaller, less complete patches.
The behavior metrics point the same way. On GraphQL, Opus averaged 3.17 explicit planning calls per task; GPT-5.5 averaged zero. Opus made 10.2 patch calls per task; GPT-5.5 made 9.9. Opus was not bailing early. The difference was exploration style: GPT-5.5 made about twice as many shell calls and more search calls, while Opus spent more of its budget in planning and patch rewrite churn. In this repo, broader repo inspection appears to have mattered more than deliberating over a narrower patch.
Model personalities, in one paragraph each
Opus 4.7 — under-reach. Conservative, precise, low-footprint. Strong when the task is local and the desired change has a narrow surface. Weak when the human patch includes companion surfaces the tests do not fully cover. Its failure mode is often "tests pass, but this is not the same change."
GPT-5.4 — right shape, wrong execution. Directionally capable but uneven. It often finds the intended shape, which is why its equivalence numbers are respectable, but it is more prone to stale mirrors, extra bookkeeping, unearned refactors, and patches that the judge likes more than the test suite does.
GPT-5.5 — broader, bigger footprint. More complete on integration surface. It is more likely to update the surrounding code, pass review, and convert intended behavior into passing code. Its risk is patch footprint: when it is wrong, it can be wrong over more files.
Why this matters
The practical question is not "which model is best?"
The practical question is:
For this repo, under this harness, on the kinds of tasks we actually ship, which model produces patches we trust?
The answer changed by repo.
Zod made GPT-5.5 versus Opus look like a tradeoff: same test pass count, GPT-5.5 better reviewer alignment, Opus smaller patches.
graphql-go-tools made the tradeoff less symmetrical: GPT-5.5 was simply more shippable on the measured tasks, while Opus's small-patch advantage came with too much missed integration work.
That is why Stet is built around real repo tasks instead of synthetic prompts. Your repo has its own mirror trees, codegen surfaces, test blind spots, hook conventions, planner invariants, and review standards. You also have your own AGENTS.md, skills, model and harness settings, etc. Those details decide whether a model's "personality" is an asset or a liability.
Caveats
Fifty-six scored tasks is still small. One task swing moves a repo-level rate by a few points. Every model ran once per task. Some close calls would flip on rerun.
The equivalence and rubric judge was GPT-5.4. That can introduce family bias. I do not think it explains the whole result: GPT-5.5 beats GPT-5.4 decisively, Opus still wins footprint, and many Opus equivalence losses are concrete missed files or missing companion surfaces.
Results are also harness-conditional. Claude Code and Codex CLI bring different system prompts, planning loops, and tool surfaces, and each model ran in the harness its vendor ships. Running Opus 4.7 inside Codex via API, or GPT-5.5 inside Claude Code, would change the picture. The numbers here describe these models in the harnesses real engineers actually use them in — not the models in isolation.
Takeaway
If I had to summarize the 56 scored tasks:
- GPT-5.5 is the best default shipping model across these two repos.
- Opus 4.7 is still the low-footprint model and can be preferable when narrow diffs matter most.
- GPT-5.4 is cheaper per task, but not enough better on cost to overcome the clean-pass gap here.
- Tests alone would have hidden the most important result.
- The same model ranking changed by repo, which is the point.
The interesting model eval is no longer "can the model solve a hard prompt?" It is "what kind of patch does this model tend to produce in my codebase, and does that match how my team ships software?"
I got tired of Claude Code silently dying at rate limits during mid task, so I built something
I got tired of Claude Code silently hitting rate limits, so I decided to build something to address the issue.
Imagine you’re 40 minutes into a refactor. Claude is running tools and making progress, then suddenly, everything stops. The session has reached its rate limit without any warning—no alert saying you’re at 95%, just a complete halt. The usage bars are visible in the UI, but the model itself remains unaware of them.
I discovered that Anthropic has a usage API, and Claude Code already possesses hooks to make it work. This led me to create agent-baton, which reads the usage API and installs hooks to make Claude aware of its limits.
Here are the three hooks you can initiate with one command (baton init):
- SessionStart: Fetches usage data and injects it so Claude knows from the first message how much has been used.
- UserPromptSubmit: Performs a time-to-live (TTL) aware check that avoids overwhelming the API. It uses smart caching—checking every 15 minutes when usage is low and once a minute when it's nearing the limit.
- PreToolUse: This is the crucial one; it checks usage mid-task to prevent the scenario where you “started at 93% and ran out of capacity mid-execution,” catching the problem within 1-2 tool calls.
When the warning threshold is reached, it prompts an interactive question using Claude Code's built-in AskUserQuestion tool:
"Claude 5-hour usage is at 91% — you're in the warning zone."
Options include:
- Continue this task
- Write a handoff document
- Switch to lightweight mode
It also handles full agent handoffs by writing a structured markdown handoff and passing work to Cursor, Codex, or Gemini.
You can install it with the following command:
npm install -g u/codeprakhar25/agent-baton && baton init
For more details, visit the GitHub repository.
If you’re bleeding tokens on data grids, here is a Claude Skill that 10x’d my dev speed and cut my token usage by 85%!
Hello everyone,
Just wanted to share Lytenyte Grid AI Skills. If you use Claude Code for your frontend UI and need a data grid, this will 100% help you save a ton of time and drastically reduce token usage!
Like me, you have probably learned that prompting your way to a data grid that works usually ends in a mess and broken edge cases. There are many good reasons for this, but basically, “that ish gets complex.”
LyteNyte Grid AI Skills is free and open source. It comes with 20 highly detailed reference files that cover virtually every aspect of the data grid, from installation to complex implementations.
If you're unfamiliar with LyteNyte Grid, it’s a 40 KB, lightning-fast, zero-dependency data grid with over 150 features (shameless marketing pitch, apologies!).
Anyways, the reason Skills is so unbelievably effective with LyteNyte Grid is that, unlike other grids, LyteNyte Grid has a declarative API and a 100% stateless, fully prop-driven architecture.
At the risk of getting overly technical, here is why this architecture suddenly makes Claude Code effective at building grid implementations for your app:
- Native React Context: Claude inherently understands React. LyteNyte is built in React for React (no wrappers), keeping Claude's output pure.
- No Translation Layers: Because it’s fully prop-driven, Claude doesn't have to guess or write messy mapping code.
- Simpler Prompts: It relies on familiar React patterns, allowing Claude to hit zero-shot accuracy with much shorter prompts.
- A11y Built-in: Claude no longer hallucinates custom screen-reader properties or aria-tags to make things work.
Honestly, we have been blown away by the results. I wanted to share this with the community and get your honest feedback. As I said, it’s completely free and open source.
If you find this helpful and like what we’re building, GitHub stars help. Feature suggestions and code contributions are always welcome.
Paid $200 for Max Plan, account stuck on Free, and the Support Bot is in an infinite loop. (Is there a human at Anthropic?)
Nobody reads the README anymore. Make Claude draw you the map instead.
You ask the Al to plan something with caching, API, frontend, rate limiting.
What you get back is 8 sections, 200 lines, a wall of markdown. You scroll through once, miss the dependency between two layers, and you're reading the whole thing again from the top.
The plan isn't text. It's a graph.
The same plan as one HTML file changes the work. Click the API node, see what hits it. Click rate limiting, see which routes it protects. Click caching, see what invalidates it. Every layer drillable, every connection visible.
>caching keys + invalidation rules, all clickable
> API endpoints + schemas + which UI components consume them
> rate limit rules + the routes they apply to + the storage backend
> the whole thing in one file you open in any browser
HTML for the human who has to understand the plan. Markdown for the agent who has to execute it.
Andrej has joined Anthropic
I think this is good for the future,
He is a good man, honest guy, and care deeply about our future.
designmd.sh — a public registry for DESIGN.md files for coding agents
designmd.sh is basically skills.sh for design systems.
If your DESIGN.md files are public on GitHub and you want developers, designers, and AI builders to discover them more easily, you can now showcase and index them.
This felt like a Silicon Valley sketch
Tried /goal for the first time:
Fix this type error so it works with Node 24...
Claude Code fixed it by adding a setting that ignores the error entirely.
My prompt was weak, the solution was weaker, and somehow everybody involved felt confident. This felt like a Silicon Valley sketch where AI fixed all code by removing it.