IME tok/s is only useful with the additional context of ttft and total latency. At this point a given closed-model does not exist in a vaccuum but rather in a wider architecture that affects the actual performance profile for an API consumer.
This isn't usually an issue comparing models within the same provider, but it does mean cross-provider comparison using only tok/s is not apples-to-apples in terms of real-world performance.
Exactly. Really frustrating they don't advertise TTFT and etc, and that it's really hard to find any info in that regard on newer models.
For voice agents gpt-4.1 and gpt-4.1-mini seem to be the best low latency models when you need to handle bigger data more complex asks.
But they are a year old and trying to figure out if these new models(instant, chat, realtime, mini, nona, wtf) are a good upgrade is very frustrating. AFAICT they aren't; the TTFT latencies are too high.
Yeah, this speed is excellent! I'm using GPT-5 mini for my "AI tour guide" (simply summarizes Wikipedia articles for me on the fly, which are presented on my app based on geolocation), and it's always been a ~15 second wait for me before streaming of a large article summarization will begin. With GPT-5.4 it's around 2-3 seconds, and the quality seems at least as good. This is a huge UX improvement, it really starts to feel more 'real time'.
Curious to hear why people pick GPT and Claude over Google (when sometimes you’d think they have a natural advantage on costs, resources and business model etc)?
In my workplace, its availability. We have to use US-only models for government-compliance reasons, so we have access to Opus 4.6 and GPT 5.4, but only Gemini 2.5 which isn't in the same class as the first two.
I wish someone would also thoroughly measure prompt processing speeds across the major providers too. Output speeds are useful too, but more commonly measured.
In my use case for small models I typically only generate a max of 100 tokens per API call, with the prompt processing taking up the majority of the wait time from the user perspective. I found OAI's models to be quite poor at this and made the switch to Anthropic's API just for this.
I've found Haiku to be a pretty fast at PP, but would be willing to investigate using another provider if they offer faster speeds.
token/sec is meaningless without thinking level. If it's fast but keeps rambling about instead of jumping on it then it can take a very long time vs low token/sec but low/none thinking.
To me, mini releases matter much more and better reflect the real progress than SOTA models.
The frontier models have become so good that it's getting almost impossible to notice meaningful differences between them.
Meanwhile, when a smaller / less powerful model releases a new version, the jump in quality is often massive, to the point where we can now use them 100% of the time in many cases.
And since they're also getting dramatically cheaper, it's becoming increasingly compelling to actually run these models in real-life applications.
If you're doing something common then maybe there are no differences with SOTA. But I've noticed a few. GPT 5.4 isn't as good at UI work in svelte. Gemini tends to go off and implement stuff even if I prompt it to discuss but it's pretty good at UI code. Claude tends to find out less about my code base than GPT and it abuses the any type in typescript.
they do are cheaper than SOTA but not getting dramatically cheaper but actually the opposite - GPT 5.4 mini is around ~3x more expensive than GPT 5.0 mini.
Similarly gemini 3.1 flash lite got more expensive than gemini 2.5 flash lite.
I use Gemini via its web app, which aggressively autoswitches to the Flash over Pro, but I usually notice quickly because the answers are weird or the logic doesn't quite follow. I feel like, at least for 'daily driver' usage, small models are still a little disappointing. That said, they're getting very good for more automation-y tasks with simple, well-constrained tasks.
> And since they're also getting dramatically cheaper, it's becoming increasingly compelling to actually run these models in real-life applications.
They're not really cheaper than the SOTA open models on third-party inference platforms, and they're generally dumber. I suppose they're still worth it if you must minimize latency for any given level of smarts, but not really otherwise.
I quite like the GPT models when chatting with them (in fact, they're probably my favorites), but for agentic work I only had bad experiences with them.
They're incredibly slow (via official API or openrouter), but most of all they seem not to understand the instructions that I give them. I'm sure I'm _holding them wrong_, in the sense that I'm not tailoring my prompt for them, but most other models don't have problem with the exact same prompt.
According to their benchmarks, GPT 5.4 Nano > GPT-5-mini in most areas, but I'm noticing models are getting more expensive and not actually getting cheaper?
Why are we treating LLM evaluation like a vibe check rather than an engineering problem?
Most "Model X > Model Y" takes on HN these days (and everywhere) seem based on an hour of unscientific manual prompting. Are we actually running rigorous, version-controlled evals, or just making architectural decisions based on whether a model nailed a regex on the first try this morning?
The OSWorld numbers are kinda getting lost in the pricing discussion but imo that's the most interesting part. Mini at 72.1% vs 72.4% human baseline is basically noise, so why not just use mini by default unless you're hitting specific failure modes.
Also context bleed into nano subagents in multi-model pipelines — I've seen orchestrators that just forward the entire message history by default (or something like messages[-N:] without any real budgeting), so your "cheap" extraction step suddenly runs with 30-50K tokens of irrelevant context. And then what's even the point, you've eaten the latency/cost win and added truncation risk on top.
Has anyone actually measured where that cutoff is in practice? At what context size nano stops being meaningfully cheaper/faster in real pipelines, not benchmarks.
Based on the SWE-Bench it seems like 5.4 mini high is ~= GPT 5.4 low in terms of accuracy and price but the latency for mini is considerably higher at 254 seconds vs 171 seconds for GPT5.4. Probably a good option to run at lower effort levels to keep costs down for simpler tasks. Long context performance is also not great.
5.4 Mini's OSWorld score is a pleasant surprise. When SOTA scores were still ~30-40 models were too slow and inaccurate for realtime computer use agents (rip Operator/Agent). Curious if anyone's been using these in production.
One thing I really want to find out, is which model and how to process TONS of pdfs very very fast, and very accurate. For prediction of invoice date, accrual accounting and other accounting related purposes. So a decent smart model that is really good at pdf and image reading. While still being very very fast.
i switched to claude when i found chatgpt would argue with just about anything I said even when it was wrong. they have over optimised antisychophancy. i want a model that simulates critical thinking not one that repeats half baked often incomplete dogmas. the chatgpt 5x range is extraordinarily powerful but also extra ordinarily frustrating to try to use for anything creative or productive that is original in my opinion. claude basically is able to think critically while being neither sycophantic or argumentative most of the time in my option with appropriate user prompting. recent chat gpts seem to fight me every step of the way when not doing boiler plate. i don't want to waste my time fighting a tool.
I've been struggling on finding a reasonably priced model to use with my toy openclaw instance. Opus 4.6 felt kinda magical but that's just too expensive and I'm not risking my max subscription for it.
GPT 5.4 mini is the first alternative that is both affordable and decent. Pretty impressed. On a $20 codex plan I think I'm pretty set and the value is there for me.
As a big Codex user, with many smaller requests, this one is the highlight: "In Codex, GPT‑5.4 mini is available across the Codex app, CLI, IDE extension and web. It uses only 30% of the GPT‑5.4 quota, letting developers quickly handle simpler coding tasks in Codex for about one-third the cost." + Subagents support will be huge.
The Nano tier is the one I'm watching. For agent workflows where you're making dozens of LLM calls per task, the cost per call matters more than peak capability. Would be interesting to see benchmarks on function calling latency specifically — that's what matters for agents.
145 comments
- Older GPT-5 Mini is about 55-60 tokens/s on API normally, 115-120 t/s when used with service_tier="priority" (2x cost).
- GPT-5.4 Mini averages about 180-190 t/s on API. Priority does nothing for it currently.
- GPT-5.4 Nano is at about 200 t/s.
To put this into perspective, Gemini 3 Flash is about 130 t/s on Gemini API and about 120 t/s on Vertex.
This is raw tokens/s for all models, it doesn't exclude reasoning tokens, but I ran models with none/minimal effort where supported.
And quick price comparisons:
- Claude: Opus 4.6 is $5/$25, Sonnet 4.6 is $3/$15, Haiku 4.5 is $1/$5
- GPT: 5.4 is $2.5/$15 ($5/$22.5 for >200K context), 5.4 Mini is $0.75/$4.5, 5.4 Nano is $0.2/$1.25
- Gemini: 3.1 Pro is $2/$12 ($3/$18 for >200K context), 3 Flash is $0.5/$3, 3.1 Flash Lite is $0.25/$1.5
This isn't usually an issue comparing models within the same provider, but it does mean cross-provider comparison using only tok/s is not apples-to-apples in terms of real-world performance.
For voice agents gpt-4.1 and gpt-4.1-mini seem to be the best low latency models when you need to handle bigger data more complex asks.
But they are a year old and trying to figure out if these new models(instant, chat, realtime, mini, nona, wtf) are a good upgrade is very frustrating. AFAICT they aren't; the TTFT latencies are too high.
gpt-5.4 is really good now also for tricky problems. Just for the unsolvable problems we take opus-4.6. Or if someone pays for it.
I've found Haiku to be a pretty fast at PP, but would be willing to investigate using another provider if they offer faster speeds.
The frontier models have become so good that it's getting almost impossible to notice meaningful differences between them.
Meanwhile, when a smaller / less powerful model releases a new version, the jump in quality is often massive, to the point where we can now use them 100% of the time in many cases.
And since they're also getting dramatically cheaper, it's becoming increasingly compelling to actually run these models in real-life applications.
Similarly gemini 3.1 flash lite got more expensive than gemini 2.5 flash lite.
> And since they're also getting dramatically cheaper, it's becoming increasingly compelling to actually run these models in real-life applications.
They're not really cheaper than the SOTA open models on third-party inference platforms, and they're generally dumber. I suppose they're still worth it if you must minimize latency for any given level of smarts, but not really otherwise.
5.4 mini seems to be a lot more wild/unstable, but with this instability it gets the right answer more often.
https://aibenchy.com/compare/openai-gpt-5-4-mini-medium/open...
> 100% of the time in many cases
So, every single time, the new model works most of the time?
They're incredibly slow (via official API or openrouter), but most of all they seem not to understand the instructions that I give them. I'm sure I'm _holding them wrong_, in the sense that I'm not tailoring my prompt for them, but most other models don't have problem with the exact same prompt.
Does anybody else have a similar experience?
GPT 5 mini: Input $0.25 / Output $2.00
GPT 5 nano: Input: $0.05 / Output $0.40
GPT 5.4 mini: Input $0.75 / Output $4.50
GPT 5.4 nano: Input $0.20 / Output $1.25
Most "Model X > Model Y" takes on HN these days (and everywhere) seem based on an hour of unscientific manual prompting. Are we actually running rigorous, version-controlled evals, or just making architectural decisions based on whether a model nailed a regex on the first try this morning?
Also context bleed into nano subagents in multi-model pipelines — I've seen orchestrators that just forward the entire message history by default (or something like messages[-N:] without any real budgeting), so your "cheap" extraction step suddenly runs with 30-50K tokens of irrelevant context. And then what's even the point, you've eaten the latency/cost win and added truncation risk on top.
Has anyone actually measured where that cutoff is in practice? At what context size nano stops being meaningfully cheaper/faster in real pipelines, not benchmarks.
For many "simple" LLM tasks, GPT-5-mini was sufficient 99% of the time. Hopefully these models will do even more and closer to 100% accuracy.
The prices are up 2-4x compared to GPT-5-mini and nano. Were those models just loss leaders, or are these substantially larger/better?
Did GPT write them?
5.4 mini seems to struggle with consistency, and even with temperature 0 sometimes gives the correct response, sometimes a wrong one...
[0]: https://aibenchy.com/compare/openai-gpt-5-4-medium/openai-gp...
GPT 5.4 mini is the first alternative that is both affordable and decent. Pretty impressed. On a $20 codex plan I think I'm pretty set and the value is there for me.