Very good move. In my experience, for system programming at least, GPT 5.4 xhigh is vastly superior to Claude Opus 4.6 max effort. I ran many brutal tests, including reconstructing for QEMU the SCSI controller (not longer accessible) of a SVSY UNIX of the early 90s used in a 386. Side by side, always re-mirroring the source trees each time one did a breakthrough in the implementation. Well, GPT 5.4 single handed did it all, while Opus continued to take wrong paths. The same for my Redis bug tracking and development. But 200$ is too much for many people (right now, at least: the reality is that if frontier LLMs are not democratized, we will end paying like a house rent to a few providers), and also while GPT 5.4 is much stronger, it is slower and less sharp when the thing to do is simple, so many people went for Claude (also because of better marketing and ethical concerns, even if my POV is different on that side: both companies sell LLM models with similar capabilities and similar internal IP protection and so forth, to me they look very similar in practical terms). This will surely change things, and many people will end with a Claude 5x account + a Codex 5x account I bet.
GPT 5.4 is the surly physics PhD post-doc who slowly and angrily sits in a basement to write brilliant, undocumented, uncommented code that encapsulates a breakthrough algorithm.
Opus 4.6 is the L5 new hire SWE keen to prove their chops and quickly turn out totally reasonable code with putatively defensible reasons for doing it that way (that are sometimes tragically wrong) and then catch an after-work yoga class with you.
> and then catch an after-work yoga class with you.
That's cute, but do you mean something concrete with this, aka are there some non-coding prompting you use it for that you're referring to with that or is it simply a throwaway line about L5 SWEs (at a FAANG).
(FWIW, I find myself using ChatGPT for non-coding prompting for some reason, like random questions like if oil is fungible and not Claude, for some reason.)
Thanks for confirming my impressions, it's been like 4 months now that I've arrived at the same conclusions. GPT models are just better at any kind of low-level work: reverse engineering including understanding what the decompiled code/assembly does, renaming that decompiled code (functions/types), any kind of C/C++, way more reliable security research (Opus will find way more, but most will turn out to be false positives). I've had GPT create non-trivial custom decompilers for me for binaries built with specific compilers (it's a much simpler task than what IDA Pro/Ghidra are doing but still complex), and modify existing Java decompilers.
Regarding speed, I don't use xhigh that often, and surprisingly for me GPT 5.4 high is faster than Claude 4.6 Opus high (unless you enable fast mode for Opus).
Of course I still use Opus for frontend, for some small scripts, and for criticizing GPT's code style, especially in Python (getattr).
+1 to this, I've found GPT/Codex models consistently stronger in engineering tasks (such as debugging complex, cross-systems issues, concurrency problems, etc).
I use both OpenAI and Anthropic models, though for different purposes, what surprises me is how underrated GPT still feels (or, alternatively, how overhyped Anthropic models can be) given how capable it is in these scenarios. There also seems to be relatively little recognition of this in the broader community (like your recent YouTube video). My guess is that demand skews toward general codegen rather than the kind of deep debugging and systems work where these differences really show.
My non scientific tests has been that GPT models follow the prompts literally. Every time I give it an example, it uses the example in literal sense instead of using it to enhance its understanding of the ask. This is a good thing if I want it to follow instructions but bad if I want it to be creative. I have to tell it that the examples I gave are just examples and not to be used in output. I feel comfortable using it when I have everything mapped out.
Claude on the other hand can be creative. It understands that examples are for reference purposes only. But there are times it decides to off on a tangent on its own and decide not to follow instructions closely. I find it useful for bouncing off ideas or test something new,
The other thing I notice is Claude has slightly better UI design sensibilities even if you don’t give instructions. GPT on the other hand needs instructions otherwise every UI element will be so huge you need to double scroll to find buttons.
What I like most about gpt coding models is how predictable of a lever that thinking effort is.
Xhigh will gather all the necessary context. low gathers the minimum necessary context.
That doesn’t work as well with me for Opus. Even at max effort it’ll overlook files necessary to understanding implementations. It’s really annoying when you point that out and you get hit with an”you’re absolutely right”.
Codex isn’t the greatest one shot horse in the race but, once you figure out how to harness it, it’s hard to go back to other models.
GPT5.4 with any effort level is scary when you combine it with tricks like symbolic recursion. I actually had to reduce the effort level to get the model to stop trying to one shot everything. I struggled to come up with BS test cases it couldn't dunk in some clever way. Turning down the reasoning effort made it explore the space better.
Yup I've mentioned this in another thread, I got gpt 5.4xhigh to improve the throughout of a very complex non typical CUDA kernel by 20x. This was through a combination of architecture changes and then do low level optimizations, it did the profiling all by itself. I was extremely impressed.
> right now, at least: the reality is that if frontier LLMs are not democratized, we will end paying like a house rent to a few providers
This part of your comment has slipped through but is very worrying for me. I _think_ we're passing the point now where programmers are accepting that LLMs writing code are the real deal. Lots of antagonism along the way, but the reality is these things are good, and getting better all the time.
What this means in reality, in my opinion, is that if you're an independent programmer, or smaller company trying to compete with others to earn a living, you're almost certainly going to have to use coding agents, which means your competitiveness in the market is going to be gated by the big model providers until we have more options. If you somehow get banned from a few of them, which seems like it can happen through no fault of your own, you're going to be seriously negatively impacted.
That's quite worrying having gatekeepers to our industry where it was previously in our own hands.
> I ran many brutal tests, including reconstructing for QEMU the SCSI controller (not longer accessible) of a SVSY UNIX of the early 90s used in a 386.
QEMU is one project that, for a variety of reasons, said that atm they simply refuse any code written by a LLM. Is this just as a test? Or just for you? Or do you think QEMU shall accept that patch?
Really great to see this whole thread after so many questioning looks from people on why I use codex instead of Claude which generally doesn't work for me.
I never thought it was about particular usefulness for low level vs high level but it tracks with my general low level work.
1000%. I have been running claude's work through codex for about a week now and it's insane the number of mistakes it catches. Not really sure why I've been doing this, just interesting to watch I guess.
Not to mention a billion times more usage than you get with claude, dollar for dollar.
The $100/mo giving access to GPT Pro (with reduced usage) is a nice counter to the just teased Claude Mythos. But GPT 5.4 xhigh being able to perform that kind of low-level reconstruction task is very impressive already.
I completely agree with you on both the technical and ethical reasoning.
Thank you for speaking out. I think it's important that reputable engineers like you do so. The Claude gang gaslighting is unhinged right now. It would be none of my concern but I have to deal with it in the real world - my customers are susceptible to these memes. I'm sure others have to deal with similar IRL consequences, too.
It's interesting seeing all the ChatGPT users in this thread, knowing what we know about OpenAI. Either they don't care about what OpenAI does, don't know their reputation, or feel like their use is too insignificant to matter.
5.4, in my own testing, was almost always ahead of Opus 4.6 for reviews and planning. I'm on plus plan on openai, so I couldn't test it so deeply. Anyone who had more experience on both could perhaps chime in? Pros/cons compared to Opus? I'm invested in Claude ecosystem but the recent quality and session limits decrease have me on the edge.
For my money, on the code side at least, GitHub Copilot on VSCode is still the most cost effective option, 10 bucks for 300 requests gets me all I need, especially when I use OpenAI models which are counted as 1x vs Opus which is 3x. I've stopped using all other tools like Claude Code etc.
The title is misleading. The only thing they seem to have done was add a $100 plan identical to Claude's, which gives 5x usage of ChatGPT Plus. There is still a $200 plan that gives 20x usage.
That has me quite tempted. In general, I stay under the Plus limits, but I do watch my consumption. I could use /fast mode all of the time, with extra high reasoning, and use gpt-5.4-pro for especially complex tasks. It wasn't worth 10x the price to me before, but 5x is approachable.
Gpt 5.4 high with fast mode in codex app is hands down the best way to do anything coding or non coding. If you have not tried it you are missing out. 100$ well spent. Claude code is too hyped up on HN.
They are actively exploiting the compute shortages of Anthropic. In our team we're pushing for more or less vanilla and portability, since the best harness today might not be the best one in 6 months.
I wish these plans had burst mode where I could set default plan size and max plan size and just scale up for a month automatically if needed but automatically drop back to my default plan at the next billing cycle
It really feels like LLMs will mostly become tools for tech workers rather than the kind of civilization-level transformation sama has been peddling. Every single comment here seems to confirm the above.
For me it's not the price. It's the fact that they obviously read my prompts and may even use a derived version of my data for training. As it's very clear in the meantime that SAMA lies most of the time, there's just no way I can trust this company in any way.
251 comments
Opus 4.6 is the L5 new hire SWE keen to prove their chops and quickly turn out totally reasonable code with putatively defensible reasons for doing it that way (that are sometimes tragically wrong) and then catch an after-work yoga class with you.
> and then catch an after-work yoga class with you.
That's cute, but do you mean something concrete with this, aka are there some non-coding prompting you use it for that you're referring to with that or is it simply a throwaway line about L5 SWEs (at a FAANG).
(FWIW, I find myself using ChatGPT for non-coding prompting for some reason, like random questions like if oil is fungible and not Claude, for some reason.)
Regarding speed, I don't use xhigh that often, and surprisingly for me GPT 5.4 high is faster than Claude 4.6 Opus high (unless you enable fast mode for Opus).
Of course I still use Opus for frontend, for some small scripts, and for criticizing GPT's code style, especially in Python (getattr).
I use both OpenAI and Anthropic models, though for different purposes, what surprises me is how underrated GPT still feels (or, alternatively, how overhyped Anthropic models can be) given how capable it is in these scenarios. There also seems to be relatively little recognition of this in the broader community (like your recent YouTube video). My guess is that demand skews toward general codegen rather than the kind of deep debugging and systems work where these differences really show.
Claude on the other hand can be creative. It understands that examples are for reference purposes only. But there are times it decides to off on a tangent on its own and decide not to follow instructions closely. I find it useful for bouncing off ideas or test something new,
The other thing I notice is Claude has slightly better UI design sensibilities even if you don’t give instructions. GPT on the other hand needs instructions otherwise every UI element will be so huge you need to double scroll to find buttons.
Xhigh will gather all the necessary context. low gathers the minimum necessary context.
That doesn’t work as well with me for Opus. Even at max effort it’ll overlook files necessary to understanding implementations. It’s really annoying when you point that out and you get hit with an”you’re absolutely right”.
Codex isn’t the greatest one shot horse in the race but, once you figure out how to harness it, it’s hard to go back to other models.
> right now, at least: the reality is that if frontier LLMs are not democratized, we will end paying like a house rent to a few providers
This part of your comment has slipped through but is very worrying for me. I _think_ we're passing the point now where programmers are accepting that LLMs writing code are the real deal. Lots of antagonism along the way, but the reality is these things are good, and getting better all the time.
What this means in reality, in my opinion, is that if you're an independent programmer, or smaller company trying to compete with others to earn a living, you're almost certainly going to have to use coding agents, which means your competitiveness in the market is going to be gated by the big model providers until we have more options. If you somehow get banned from a few of them, which seems like it can happen through no fault of your own, you're going to be seriously negatively impacted.
That's quite worrying having gatekeepers to our industry where it was previously in our own hands.
> I ran many brutal tests, including reconstructing for QEMU the SCSI controller (not longer accessible) of a SVSY UNIX of the early 90s used in a 386.
QEMU is one project that, for a variety of reasons, said that atm they simply refuse any code written by a LLM. Is this just as a test? Or just for you? Or do you think QEMU shall accept that patch?
I never thought it was about particular usefulness for low level vs high level but it tracks with my general low level work.
Not to mention a billion times more usage than you get with claude, dollar for dollar.
Thank you for speaking out. I think it's important that reputable engineers like you do so. The Claude gang gaslighting is unhinged right now. It would be none of my concern but I have to deal with it in the real world - my customers are susceptible to these memes. I'm sure others have to deal with similar IRL consequences, too.
For my money, on the code side at least, GitHub Copilot on VSCode is still the most cost effective option, 10 bucks for 300 requests gets me all I need, especially when I use OpenAI models which are counted as 1x vs Opus which is 3x. I've stopped using all other tools like Claude Code etc.
/fastmode all of the time, with extra high reasoning, and use gpt-5.4-pro for especially complex tasks. It wasn't worth 10x the price to me before, but 5x is approachable.LE: Someone said this is how the tiers are now counted:
"Essentially if old plus is 1x then new limits are: Plus - 0.3x Pro $100 - 1.5x Pro $200 - 6x (unchanged)"
>Our existing $200 Pro tier still remains our highest usage option.
5x=$100 20x=$200
And that includes usage of the API with any agent without risking being banned. OpenAI is also very supportive of open source software.
I'm using GPT-5.4 with Swival (https://swival.dev) for a while, alongside local models, and it's absolutely fantastic.
https://snipboard.io/jmGKfM.jpg
It helps them cut the subsidization of tokens. Then they will release Pro x2, which could be the same as the old $200 but with fewer tokens.