u/v1kstrand

Here is what i got using GPT-pro extended when asking about using 5.4 vs 5.4-mini to optimize for 5h limits. Feel free to call this ai slop because it's literally a copy-paste:

"My read from the current official material is: GPT-5.4-mini can get surprisingly close to full GPT-5.4 on some coding-style evals, but it is not a blanket substitute. On the published xhigh benchmarks, GPT-5.4-mini is only 3.3 points behind GPT-5.4 on SWE-Bench Pro (54.4% vs 57.7%) and 2.9 points behind on OSWorld-Verified (72.1% vs 75.0%), but the gap is much larger on Terminal-Bench 2.0 (60.0% vs 75.1%) and Toolathlon (42.9% vs 54.6%). OpenAI still positions gpt-5.4 as the default for most important coding work and gpt-5.4-mini as the faster, cheaper option for lighter coding tasks and subagents. (OpenAI)

So to your direct question — can 5.4-mini high perform as well as 5.4-low? On some bounded, explicit, test-backed coding tasks, probably yes. As a general routing rule, I would not assume equivalence. I did not find a public official matrix that directly compares full 5.4 at low against mini at high; the public release material shows xhigh snapshots and says reasoning efforts were swept from low to xhigh, but it does not publish the cross-effort table. The current prompt guidance also says gpt-5.4-mini is more literal and weaker on implicit workflows and ambiguity handling, which is exactly where “maybe mini-high is enough” stops being safe. (OpenAI)

The biggest developer-side insight is that high should not be your default. In the current GPT-5.4 docs, newer GPT-5 models default to none; the reasoning guide says low is for a small reliability bump, medium/high are for planning, coding, synthesis, and harder reasoning, and xhigh should be used only when your evals show the extra latency and cost are justified. The GPT-5.4 prompt guide also explicitly says higher effort is not always better, and that you should often improve completion rules, verification loops, and tool-persistence rules before raising reasoning effort. (OpenAI Platform)

The safest way to think about “hardness” is on three axes rather than one: ambiguity, horizon, and working-set size. Ambiguity: OpenAI says mini is more literal and weaker on implicit workflows. Horizon: full 5.4 keeps a much larger lead on terminal/tool-heavy evals than on SWE-style bugfix evals. Working-set size: full 5.4 has a 1.05M context window versus 400K for mini, and mini’s documented long-context scores drop sharply once the eval moves into the 64K–256K range — for example MRCR v2 is 86.0% vs 47.7% at 64K–128K and 79.3% vs 33.6% at 128K–256K. So once the task needs a big repo slice, many files, or lots of docs/logs in play, mini stops being the “safe” default even if the raw coding gap looked small. (OpenAI Developers)

My quota-preserving routing rule — this is my synthesis, not an official OpenAI taxonomy — would be: use 5.4-mini at none/low for reconnaissance, repo search, code explanation, mechanical edits, and bugfixes with a clear repro or failing test; use 5.4-mini at medium/high for bounded multi-file work with explicit specs or strong acceptance tests; escalate to 5.4 at low when ambiguity, tool/terminal horizon, or working-set size gets high; escalate to 5.4 at medium/high for production migrations, security/auth/concurrency work, sparse-test repos, or after a lower-effort pass misses; and reserve xhigh for the cases where you have evidence it helps. (OpenAI Developers)

On raw token cost, mini has a very large structural edge. GPT-5.4 is $2.50 / $0.25 cached / $15.00 per 1M input / cached / output tokens, while GPT-5.4-mini is $0.75 / $0.075 cached / $4.50 — basically 3.33x cheaper across all three billed token categories. Reasoning tokens are tracked inside output/completion usage and count toward billing and usage, so high/xhigh costs more mainly because it generates more billable output/reasoning tokens, not because reasoning effort has its own separate surcharge. Rule of thumb: mini-high can still be cheaper than full-low unless it expands billable tokens by roughly more than that 3.3x price advantage. (OpenAI Developers)

For a representative medium-heavy coding turn, if you send about 60k fresh input tokens and get 15k output tokens back, the API cost is about $0.375 on GPT-5.4 versus $0.1125 on GPT-5.4-mini. For a later iterative turn with about 60k cached input, 15k fresh input, and 6k output, it comes out to about $0.1425 on GPT-5.4 versus $0.0428 on mini. Those mixes are just examples, not official medians, but the stable part is the roughly 3.33x raw price gap. (OpenAI Developers)

If your main problem is the Codex 5-hour limit rather than API dollars, the current Codex pricing page points in the same direction. On Pro, the documented local-message range is 223–1120 per 5h for GPT-5.4 versus 743–3733 per 5h for GPT-5.4-mini; on Plus, it is 33–168 versus 110–560. OpenAI also says switching to mini for routine tasks should extend local-message limits by roughly 2.5x to 3.3x, and the mini launch post says Codex mini uses only about 30% of GPT-5.4 quota. The docs also note that larger codebases, long-running tasks, extended sessions, and speed configurations burn allowance faster; /status and the Codex usage dashboard show what you have left. (OpenAI Developers)

The highest-leverage protocol for “hours of work without tanking the 5h window” is a planner/executor split: let full 5.4 handle planning, coordination, and final judgment, and let mini handle narrower subtasks. Beyond model choice, OpenAI’s own tips are to keep prompts lean, shrink AGENTS.md, disable unneeded MCP servers, and avoid fast/speed modes unless you really need them, because those increase usage and fast mode consumes 2x credits. If you are driving this through the API, use the Responses API with previous_response_id, prompt caching, compaction, and lower verbosity when possible; the docs say this improves cache hit rates, reduces re-reasoning, and helps control cost and latency as sessions grow. One subtle point: the published 24h extended prompt-cache list includes gpt-5.4, but I did not see gpt-5.4-mini listed there, so for very long iterative sessions with a huge stable prefix, full 5.4 has a documented caching advantage. (OpenAI)

A conservative default would be: mini-low first, mini-high second, full-low for anything ambiguous or repo-wide, full-high only when the task is both important and clearly hard."

5.4-mini-high vs 5.4-low (tokens, performace, stabillity)