GPT‑5.4 Mini and Nano

[−] Tiberium 60d ago

I checked the current speed over the API, and so far I'm very impressed. Of course models are usually not as loaded on the release day, but right now:

- Older GPT-5 Mini is about 55-60 tokens/s on API normally, 115-120 t/s when used with service_tier="priority" (2x cost).

- GPT-5.4 Mini averages about 180-190 t/s on API. Priority does nothing for it currently.

- GPT-5.4 Nano is at about 200 t/s.

To put this into perspective, Gemini 3 Flash is about 130 t/s on Gemini API and about 120 t/s on Vertex.

This is raw tokens/s for all models, it doesn't exclude reasoning tokens, but I ran models with none/minimal effort where supported.

And quick price comparisons:

- Claude: Opus 4.6 is $5/$25, Sonnet 4.6 is $3/$15, Haiku 4.5 is $1/$5

- GPT: 5.4 is $2.5/$15 ($5/$22.5 for >200K context), 5.4 Mini is $0.75/$4.5, 5.4 Nano is $0.2/$1.25

- Gemini: 3.1 Pro is $2/$12 ($3/$18 for >200K context), 3 Flash is $0.5/$3, 3.1 Flash Lite is $0.25/$1.5

[−] rglynn 60d ago

IME tok/s is only useful with the additional context of ttft and total latency. At this point a given closed-model does not exist in a vaccuum but rather in a wider architecture that affects the actual performance profile for an API consumer.

This isn't usually an issue comparing models within the same provider, but it does mean cross-provider comparison using only tok/s is not apples-to-apples in terms of real-world performance.

[−] Rapzid 60d ago

Exactly. Really frustrating they don't advertise TTFT and etc, and that it's really hard to find any info in that regard on newer models.

For voice agents gpt-4.1 and gpt-4.1-mini seem to be the best low latency models when you need to handle bigger data more complex asks.

But they are a year old and trying to figure out if these new models(instant, chat, realtime, mini, nona, wtf) are a good upgrade is very frustrating. AFAICT they aren't; the TTFT latencies are too high.

[−] widdershins 59d ago

Yeah, this speed is excellent! I'm using GPT-5 mini for my "AI tour guide" (simply summarizes Wikipedia articles for me on the fly, which are presented on my app based on geolocation), and it's always been a ~15 second wait for me before streaming of a large article summarization will begin. With GPT-5.4 it's around 2-3 seconds, and the quality seems at least as good. This is a huge UX improvement, it really starts to feel more 'real time'.

[−] daniel_iversen 60d ago

Curious to hear why people pick GPT and Claude over Google (when sometimes you’d think they have a natural advantage on costs, resources and business model etc)?

[−] coderjames 60d ago

In my workplace, its availability. We have to use US-only models for government-compliance reasons, so we have access to Opus 4.6 and GPT 5.4, but only Gemini 2.5 which isn't in the same class as the first two.

[−] rurban 59d ago

Because Claude is so much more expensive, and I rarely need the best.

gpt-5.4 is really good now also for tricky problems. Just for the unsolvable problems we take opus-4.6. Or if someone pays for it.

[−] fullstackchris 59d ago

Have you used gemini models for code work? Claude and Codex are miles ahead in terms of quality and how thorough they are

[−] coder543 60d ago

I wish someone would also thoroughly measure prompt processing speeds across the major providers too. Output speeds are useful too, but more commonly measured.

[−] JLO64 60d ago

In my use case for small models I typically only generate a max of 100 tokens per API call, with the prompt processing taking up the majority of the wait time from the user perspective. I found OAI's models to be quite poor at this and made the switch to Anthropic's API just for this.

I've found Haiku to be a pretty fast at PP, but would be willing to investigate using another provider if they offer faster speeds.

[−] asselinpaul 60d ago

OpenRouter has this information

[−] coder543 60d ago

I do not see prompt processing, only some kind of nebulous “throughput” that could be output or input+output, but definitely not input only.

[−] msp26 60d ago

Man the lowest end pricing has been thoroughly hiked. It was convenient while it lasted.

[−] attentive 60d ago

token/sec is meaningless without thinking level. If it's fast but keeps rambling about instead of jumping on it then it can take a very long time vs low token/sec but low/none thinking.

[−] rattray 60d ago

Wow. How fast is haiku?

[−] BoumTAC 60d ago

To me, mini releases matter much more and better reflect the real progress than SOTA models.

The frontier models have become so good that it's getting almost impossible to notice meaningful differences between them.

Meanwhile, when a smaller / less powerful model releases a new version, the jump in quality is often massive, to the point where we can now use them 100% of the time in many cases.

And since they're also getting dramatically cheaper, it's becoming increasingly compelling to actually run these models in real-life applications.

[−] brikym 60d ago

If you're doing something common then maybe there are no differences with SOTA. But I've noticed a few. GPT 5.4 isn't as good at UI work in svelte. Gemini tends to go off and implement stuff even if I prompt it to discuss but it's pretty good at UI code. Claude tends to find out less about my code base than GPT and it abuses the any type in typescript.

[−] pzo 60d ago

they do are cheaper than SOTA but not getting dramatically cheaper but actually the opposite - GPT 5.4 mini is around ~3x more expensive than GPT 5.0 mini.

Similarly gemini 3.1 flash lite got more expensive than gemini 2.5 flash lite.

[−] ainch 60d ago

I use Gemini via its web app, which aggressively autoswitches to the Flash over Pro, but I usually notice quickly because the answers are weird or the logic doesn't quite follow. I feel like, at least for 'daily driver' usage, small models are still a little disappointing. That said, they're getting very good for more automation-y tasks with simple, well-constrained tasks.

[−] zozbot234 60d ago

> And since they're also getting dramatically cheaper, it's becoming increasingly compelling to actually run these models in real-life applications.

They're not really cheaper than the SOTA open models on third-party inference platforms, and they're generally dumber. I suppose they're still worth it if you must minimize latency for any given level of smarts, but not really otherwise.

[−] XCSme 60d ago

Well, in that case, the difference is quite minimal between 5 mini and 5.4 mini

5.4 mini seems to be a lot more wild/unstable, but with this instability it gets the right answer more often.

https://aibenchy.com/compare/openai-gpt-5-4-mini-medium/open...

[−] sebastiennight 60d ago

> 100% of the time in many cases

So, every single time, the new model works most of the time?

[−] pscanf 60d ago

I quite like the GPT models when chatting with them (in fact, they're probably my favorites), but for agentic work I only had bad experiences with them.

They're incredibly slow (via official API or openrouter), but most of all they seem not to understand the instructions that I give them. I'm sure I'm _holding them wrong_, in the sense that I'm not tailoring my prompt for them, but most other models don't have problem with the exact same prompt.

Does anybody else have a similar experience?

[−] simonw 60d ago

Here's a grid of pelicans for the different models and reasoning levels: https://static.simonwillison.net/static/2026/gpt-5.4-pelican...

[−] HugoDias 60d ago

According to their benchmarks, GPT 5.4 Nano > GPT-5-mini in most areas, but I'm noticing models are getting more expensive and not actually getting cheaper?

GPT 5 mini: Input $0.25 / Output $2.00

GPT 5 nano: Input: $0.05 / Output $0.40

GPT 5.4 mini: Input $0.75 / Output $4.50

GPT 5.4 nano: Input $0.20 / Output $1.25

[−] mikkelam 60d ago

Why are we treating LLM evaluation like a vibe check rather than an engineering problem?

Most "Model X > Model Y" takes on HN these days (and everywhere) seem based on an hour of unscientific manual prompting. Are we actually running rigorous, version-controlled evals, or just making architectural decisions based on whether a model nailed a regex on the first try this morning?

[−] ibrahim_h 60d ago

The OSWorld numbers are kinda getting lost in the pricing discussion but imo that's the most interesting part. Mini at 72.1% vs 72.4% human baseline is basically noise, so why not just use mini by default unless you're hitting specific failure modes.

Also context bleed into nano subagents in multi-model pipelines — I've seen orchestrators that just forward the entire message history by default (or something like messages[-N:] without any real budgeting), so your "cheap" extraction step suddenly runs with 30-50K tokens of irrelevant context. And then what's even the point, you've eaten the latency/cost win and added truncation risk on top.

Has anyone actually measured where that cutoff is in practice? At what context size nano stops being meaningfully cheaper/faster in real pipelines, not benchmarks.

[−] powera 60d ago

I've been waiting for this update.

For many "simple" LLM tasks, GPT-5-mini was sufficient 99% of the time. Hopefully these models will do even more and closer to 100% accuracy.

The prices are up 2-4x compared to GPT-5-mini and nano. Were those models just loss leaders, or are these substantially larger/better?

[−] cbg0 60d ago

Based on the SWE-Bench it seems like 5.4 mini high is ~= GPT 5.4 low in terms of accuracy and price but the latency for mini is considerably higher at 254 seconds vs 171 seconds for GPT5.4. Probably a good option to run at lower effort levels to keep costs down for simpler tasks. Long context performance is also not great.

[−] technocrat8080 60d ago

5.4 Mini's OSWorld score is a pleasant surprise. When SOTA scores were still ~30-40 models were too slow and inaccurate for realtime computer use agents (rip Operator/Agent). Curious if anyone's been using these in production.

[−] ryao 60d ago

I will be impressed when they release the weights for these and older models as open source. Until then, this is not that interesting.

[−] tintor 60d ago

Several customer testimonials for GPT-5.4 Mini have em dashes in them.

Did GPT write them?

[−] fastpdfai 60d ago

One thing I really want to find out, is which model and how to process TONS of pdfs very very fast, and very accurate. For prediction of invoice date, accrual accounting and other accounting related purposes. So a decent smart model that is really good at pdf and image reading. While still being very very fast.

[−] morpheos137 60d ago

i switched to claude when i found chatgpt would argue with just about anything I said even when it was wrong. they have over optimised antisychophancy. i want a model that simulates critical thinking not one that repeats half baked often incomplete dogmas. the chatgpt 5x range is extraordinarily powerful but also extra ordinarily frustrating to try to use for anything creative or productive that is original in my opinion. claude basically is able to think critically while being neither sycophantic or argumentative most of the time in my option with appropriate user prompting. recent chat gpts seem to fight me every step of the way when not doing boiler plate. i don't want to waste my time fighting a tool.

[−] XCSme 60d ago

It's odd, that on many benchmarks, including mine[0], Nano does better than Mini.

5.4 mini seems to struggle with consistency, and even with temperature 0 sometimes gives the correct response, sometimes a wrong one...

[0]: https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...

[−] nicpottier 60d ago

I've been struggling on finding a reasonably priced model to use with my toy openclaw instance. Opus 4.6 felt kinda magical but that's just too expensive and I'm not risking my max subscription for it.

GPT 5.4 mini is the first alternative that is both affordable and decent. Pretty impressed. On a $20 codex plan I think I'm pretty set and the value is there for me.

[−] beklein 60d ago

As a big Codex user, with many smaller requests, this one is the highlight: "In Codex, GPT‑5.4 mini is available across the Codex app, CLI, IDE extension and web. It uses only 30% of the GPT‑5.4 quota, letting developers quickly handle simpler coding tasks in Codex for about one-third the cost." + Subagents support will be huge.

[−] michaelgdwn 60d ago

The Nano tier is the one I'm watching. For agent workflows where you're making dozens of LLM calls per task, the cost per call matters more than peak capability. Would be interesting to see benchmarks on function calling latency specifically — that's what matters for agents.

GPT‑5.4 Mini and Nano (openai.com)

145 comments