u/Wonderful-Agency-210

Codex ships ~15k tokens of overhead per request. Claude Code ships 27k. Pi ships 2.6k. Here's the harness tax in each one fo them
▲ 22 r/codex

Codex ships ~15k tokens of overhead per request. Claude Code ships 27k. Pi ships 2.6k. Here's the harness tax in each one fo them

I've been noticing my codex bills scale way faster than the actual work I'm doing. Not a huge deal on small tasks, but on longer coding sessions the math starts feeling off. So I decided to actually measure what's happening under the hood.

Setup

I routed three coding agents through a gateway that logs every raw request going to the model. Same model tier where possible, same two-message task for all three:

  • Message 1: "hey"
  • Message 2: "write a simple python script to check fibonacci series and save on desktop as agent.py"

The three agents:

  • Pi (the minimal agent behind OpenClaw, 4 tools: read, write, edit, shell)
  • OpenAI Codex (10 tools)
  • Claude Code (28 tools)

Then I logged every input token for the full session until each agent marked the task done.

Results

Per-request input token overhead (what gets sent before the model does any useful work):

  • Pi: ~2,600 tokens
  • Codex: ~15,000 tokens
  • Claude Code: ~27,000 tokens

Full session totals across the 3-4 turn conversation:

  • Pi: 8,650
  • Codex: 46,725
  • Claude Code: 83,487

Same task. Same output. 9.6x spread between the leanest and heaviest.

https://preview.redd.it/pzbmru8w5rvg1.png?width=1200&format=png&auto=webp&s=a18a23ed3e6324c003c91cf4eb4c8606ef71afa4

What's in that 15k tokens?

Tool definitions, system prompt, memory instructions, behavioral routing, and the full conversation history. All of it, on every single turn. Claude Code ships 28 tool definitions (Agent, Bash, Edit, Read, Write, Grep, Glob, WebFetch, WebSearch, CronCreate, CronDelete, TaskCreate, TaskGet, ScheduleWakeup, and a bunch more). None of them were called during the fibonacci task. They shipped on every request anyway.

Also worth noting: the conversation history isn't just your messages. It includes the model's previous responses, which are already inflated by verbose tool-call formatting. So the payload grows faster than your actual conversation does.

Why this matters beyond cost

The obvious angle is dollars. At Claude Code's rate, a typical 30-50 turn coding session burns through 1M+ input tokens, and roughly half is framework plumbing.

But there's a less obvious angle: attention.

A 200k context window carrying 28k of harness overhead isn't really a 200k window. It's a ~172k window with worse attention distribution. Every token in that overhead competes for the model's attention against your code, your files, and your actual task. On a complex refactor where the model is trying to hold three source files and a test suite across twenty turns, 28k tokens of framework plumbing aren't sitting quietly. They're noise.

The staleness problem

This is the part I find most interesting. Anthropic's own harness team has been stripping layers out over the last three model generations.

Their Sonnet 4.5 harness needed context resets because the model would start wrapping up prematurely as the window filled. With Opus 4.5, resets became unnecessary and they removed them. With Opus 4.6, they stripped out sprint decomposition entirely and it still worked better.

Three model generations, three layers of harness removed. Load-bearing in January, dead weight by March.

Harnesses encode assumptions about what the model can't do. Those assumptions expire faster than most teams refactor for them.

Is codex just bad then?

No, and I want to be careful here. This was a narrow benchmark: one trivial task, one short session. The deep tooling in Claude Code/codex probably earns its overhead back on complex, long-running tasks that genuinely exercise those 28 tools — multi-file refactors, scheduled work, cron-based automations, agent spawning, etc.

But for most of what people actually use coding agents for (write this script, fix this function, explain this file), you're paying 10x in tokens for plumbing the task doesn't need.

Here's the full writeup: https://portkey.sh/SnEj9sp

reddit.com
u/Wonderful-Agency-210 — 6 days ago

I just finished up setting my OpenClaw. What's Hermes?

Hey community,

I just finished setting up my openclaw on M1 Mac Air with gpt-5-mini model. And now everyone's yelling about hermes agent.

do i need to switch? what's the difference?

reddit.com
u/Wonderful-Agency-210 — 6 days ago
▲ 6 r/LLM_Gateways+1 crossposts

LLM Pricing is 100x Harder Than you think: We open-sourced our pricing database (3,500+ models, free API)

hey community,

i saw a thread here a couple months ago asking this exact question and it resonated hard.

https://preview.redd.it/umrpmntiejvg1.png?width=1710&format=png&auto=webp&s=5004a95eba8d3dbb7fa343095ff0f85b02965244

I've been building LLM cost infrastructure for Portkey's gateway for the last 3 years and the answer is: it's not solved because the problem is way more complex than it looks.

https://preview.redd.it/6x1efm45fjvg1.png?width=1200&format=png&auto=webp&s=c8708edc728b9019eaa3a9cbd19eef520832dc36

the naive formula (cost = tokens × rate) breaks in at least 6 ways:

  1. thinking tokens — reasoning models consume tokens for internal reasoning that never appear in the response. you still pay. if you only count visible output, you undercount agentic workloads by 30-40%.
  2. cache asymmetry — anthropic charges 25% more for cache writes ($3.75/M vs $3.00/M). openai charges nothing for writes. reads are discounted differently. a single "cache discount" multiplier is wrong for at least one provider.
  3. context thresholds — cross 128K tokens and per-token cost can double. nothing in the API response tells you which tier you hit.
  4. same model, different prices — kimi k2.5: $0.5/$2.8 on together, $0.6/$3.0 on fireworks. bedrock prepends regional prefixes, azure returns deployment names. you need extra logic just to resolve the model ID.
  5. non-token billing — images bill by resolution, video by second, audio has separate i/o rates, embeddings are input-only. each maps to a completely different pricing structure.
  6. new dimensions — started with 2 billing dimensions (input/output tokens). now 20+. web search, grounding, code execution each have their own cost model.

and we open-sourced the pricing database we use in production:

  • github+ free API: github.com/portkey-ai/models
  • 3,500+ models, 50+ providers
  • updated daily via an automated agent (claude agent SDK + skill files)
  • MIT license

if you're maintaining a pricing JSON somewhere in your repo, this might help

reddit.com
u/Wonderful-Agency-210 — 7 days ago