Epoch confirms GPT5.4 Pro solved a frontier math open problem (epoch.ai)

by in-silico • 706 comments • 480 points

706 comments

[−] qnleigh 53d ago

I am kind of amazed at how many commenters respond to this result by confidently asserting that LLMs will never generate 'truly novel' ideas or problem solutions.

> AI is a remixer; it remixes all known ideas together. It won't come up with new ideas

> it's not because the model is figuring out something new

> LLMs will NEVER be able to do that, because it doesn't exist

It's not enough to say 'it will never be able to do X because it's not in the training data,' because we have countless counterexamples to this statement (e.g. 167,383 * 426,397 = 71,371,609,051, or the above announcement). You need to say why it can do some novel tasks but could never do others. And it should be clear why this post or others like it don't contradict your argument.

If you have been making these kinds of arguments against LLMs and acknowledge that novelty lies on a continuum, I am really curious why you draw the line where you do. And most importantly, what evidence would change your mind?

[−] qnleigh 52d ago

I might as well answer my own question, because I do think there are some coherent arguments for fundamental LLM limitations:

1. LLMs are trained on human-quality data, so they will naturally learn to mimic our limitations. Their capabilities should saturate at human or maybe above-average human performance.

2. LLMs do not learn from experience. They might perform as well as most humans on certain tasks, but a human who works in a certain field/code base etc. for long enough will internalize the relevant information more deeply than an LLM.

However I'm increasingly doubtful that these arguments are actually correct. Here are some counterarguments:

1. It may be more efficient to just learn correct logical reasoning, rather than to mimic every human foible. I stopped believing this argument when LLMs got a gold metal at the Math Olympiad.

2. LLMs alone may suffer from this limitation, but RL could change the story. People may find ways to add memory. Finally, it can't be ruled out that a very large, well-trained LLM could internalize new information as deeply as a human can. Maybe this is what's happening here:

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...

[−] scoofy 52d ago

I studied philosophy focusing on the analytic school and proto-computer science. LLMs are going to force many people start getting a better understanding about what "Knowledge" and "Truth" are, especially the distinction between deductive and inductive knowledge.

Math is a perfect field for machine learning to thrive because theoretically, all the information ever needed is tied up in the axioms. In the empirical world, however, knowledge only moves at the speed of experimentation, which is an entirely different framework and much, much slower, even if there are some areas to catch up in previous experimental outcomes.

Having a focus in philosophy of language is something I genuinely never thought would be useful. It’s really been helpful with LLMs, but probably not in the way most people think. I’d say that folks curious should all be reading Quine, Wittgenstein’s investigations, and probably Austin.

[−] jhanschoo 52d ago

I think we may have similar perspectives. Regarding empirical knowledge, consider when the knowledge is in relation to chaotic systems. Characterize chaotic systems at least as systems where inaccurate observations about the system in the past and present while useful for predicting the future, nevertheless see the errors grow very quickly for the task of predicting a future state. Then indeed, prediction is difficult.

One domain of knowledge I think you have yet to mention. We can talk about fundamentally computationally hard problems. What comes to mind regarding such problems that are nevertheless of practical benefit are physics simulations, material simulations, fluid simulations, but there exist problems that are more provably computationally difficult. It seems to me that with these systems, the chaotic nature is one where even if you have one infinitely precise observation of a deterministic system, accessing a future state of the system is difficult as well, even though once accessed, memorization seems comparatively trivial.

[−] qnleigh 52d ago

Where can I read about how LLMs have changed epistemology? Is there a field of philosophy that tries to define and understand 'intelligence'? That sounds very interesting.

[−] scoofy 52d ago

There is already philosophy of mind, but it was pretty young when I was in grad school, which was really at the dawn of deep learning algorithms.

I’d say the two most important topics here are philosophy of language (understanding meaning) and philosophy of science (understanding knowledge).

I’ve already mentioned the language philosophers in an edit above, but in philosophy of science I’d add Popper as extremely important here. The concept of negative knowledge as the foundation of empirical understanding seems entirely lost on people. The Black Swan, by Nassim Taleb is a very good casual read on the subject.

[−] radio879 52d ago

Also, we can do thought experiments, simulations in our heads, that often are as good as doing them for real - it has limitations and isn't perfect though. But it does work often. Einstein used to purposely dose off in a weird position so that something hit his leg or something like that to slightly nudge him half awake so he could remember his half-dreaming state - which is where he discovered some things

[−] username135 51d ago

Any source on Einstein's behavior? Id love to read more.

[−] thaumasiotes 52d ago

> Math is a perfect field for machine learning to thrive because theoretically, all the information ever needed is tied up in the axioms.

Not really; the normal way that math progresses, just like everything else, is that you get some interesting results, and then you develop the theoretical framework. We didn't receive the axioms; we developed them from the results that we use them to prove.

[−] hnfong 52d ago

> distinction between deductive and inductive knowledge

There's also intuitive knowledge btw.

Anyway, the recent developments of AI make a lot of very interesting things practically possible. For example, our society is going to want a way to reliably tell whether something is AI generated, and a failure to do so pretty much settles the empirical part of the Turing test issue. Or alternatively if we actually find something that AI can't reliably mimic in humans, that's going to be a huge finding. By having millions of people wonder whether posts on social media are AI generated, it is the largest scale Turing test we have inadvertently conducted.

The fact that AI seems to be able to (digitally) do anything we ask for is also very interesting. If humans are not bogged down by the small details or cost of implementation concerns, and we can just say what we want and get what we wished for (digitally), what level of creativity can we reach?

Also once we get the robots to do things in the physical space...

[−] porphyra 52d ago

There are ways to go beyond the human-quality data limitation. AI can be trained on better quality than average human data because many problems are easy to verify their solutions. For example, in theory, reinforcement learning with an automatic grader on competitive programming problems can lead to an LLM that is better than humans at it.

It's also possible that there can be emergent capabilities. Perhaps a little obtuse, but you can say that humans are trained on human-quality data too and yet brilliant scientists and creative minds can rise above the rest of us.

[−] xpl 50d ago

> Their capabilities should saturate at human or maybe above-average human performance

LLMs do have superhuman reasoning speed and superhuman dedication. Speed is something you can scale, and at some point quantity can turn into quality. Much of the frontier work done by humans is just dedication, luck, and remixing other people's ideas ("standing on the shoulders of giants"), isn't it? All of this is exactly what you can scale by having restless hordes of fast-thinking agents, even if each of those agents is intellectually "just above average human".

[−] aspenmartin 51d ago

> 1. LLMs are trained on human-quality data, so they will naturally learn to mimic our limitations. Their capabilities should saturate at human or maybe above-average human performance.

Why oh why is this such a commonly held belief. RL in verifiable domains being the way around this is the entire point. It’s the same idea behind a system like AlphaGo — human data is used only to get to a starting point for RL. RL will then take you to superhuman performance. I’m so confused why people miss this. The burden of proof is on people who claim that we will hit some sort of performance wall because I know of absolutely zero mechanisms for this to happen in verifiable domains.

[−] oofbey 52d ago

The idea that they don’t learn from experience might be true in some limited sense, but ignores the reality of how LLMs are used. If you look at any advanced agentic coding system the instructions say to write down intermediate findings in files and refer to them. The LLM doesn’t have to learn. The harness around it allows it to. It’s like complaining that an internal combustion engine doesn’t have wheels to push it around.

[−] Yizahi 53d ago

LLMs can generate anything by design. LLMs can't understand what they are generating so it may be true, it may be wrong, it may be novel or it may be known thing. It doesn't discern between them, just looks for the best statistical fit.

The core of the issue lies in our human language and our human assumptions. We humans have implicitly assigned phrases "truly novel" and "solving unsolved math problem" a certain meaning in our heads. Some of us at least, think that truly novel means something truly novel and important, something significant. Like, I don't know, finding a high temperature superconductor formula or creating a new drug etc. Something which involver real intelligent thinking and not randomizing possible solutions until one lands. But formally there can be a truly novel way to pack the most computer cables in a drawer, or truly novel way to tie shoelaces, or indeed a truly novel way to solve some arbitrary math equation with an enormous numbers. Which a formally novel things, but we really never needed any of that and so relegated these "issues" to a deepest backlog possible. Utilizing LLMs we can scour for the solutions to many such problems, but they are not that impressive in the first place.

[−] zamalek 52d ago

LLMs are notoriously terrible at multiplying large numbers: https://claude.ai/share/538f7dca-1c4e-4b51-b887-8eaaf7e6c7d3

> Let me calculate that. 729,278,429 × 2,969,842,939 = 2,165,878,555,365,498,631

Real answer is: https://www.wolframalpha.com/input?i=729278429*2969842939

> 2 165 842 392 930 662 831

Your example seems short enough to not pose a problem.

[−] LatencyKills 53d ago

I've been working on a utility that lets me "see through" app windows on macOS [1] (I was a dev on Apple's Xcode team and have a strong understanding of how to do this efficiently using private APIs).

I wondered how Claude Code would approach the problem. I fully expected it to do something most human engineers would do: brute-force with ScreenCaptureKit.

It almost instantly figured out that it didn't have to "see through" anything and (correctly) dismissed ScreenCaptureKit due to the performance overhead.

This obviously isn't a "frontier" type problem, but I was impressed that it came up with a novel solution.

[1]: https://imgur.com/a/gWTGGYa

[−] bluecalm 53d ago

>>AI is a remixer; it remixes all known ideas together. It won't come up with new ideas

I always found this argument very weak. There isn't that much truly new anyway. Creativity is often about mixing old ideas. Computers can do that faster than humans if they have a good framework. Especially with something as simple as math - limited set of formal rules and easy to verify results - I find a belief computers won't beat humans at it to be very naive.

[−] energy123 53d ago

> 67,383 * 426,397 = 71,371,609,051 ... You need to say why it can do some novel tasks but could never do others.

Model interpretability gives us the answers. The reason LLMs can (almost) do new multiplication tasks is because it saw many multiplication problems in its training data, and it was cheaper to learn the compressed/abstract multiplication strategies and encode them as circuits in the network, rather than memorize the times tables up to some large N. This gives it the ability to approximate multiplication problems it hasn't seen before.

[−] SequoiaHope 53d ago

Most inventions are an interpolation of three existing ideas. These systems are very good at that.

[−] PUSH_AX 53d ago

The hardest part about any creativity is hiding your influences

[−] tornikeo 53d ago

Beliefs are not rooted in facts. Beliefs are a part of you, and people aren't all that happy to say "this LLM is better than me"

[−] jacquesm 53d ago

> e.g. 167,383 * 426,397 = 71,371,609,051

They may be wrong, but so are you.

[−] staticassertion 53d ago

I think "novel" is ill defined here, perhaps. LLMs do appear to be poor general reasoners[0], and it's unclear if they'll improve here.

It would be unintuitive for them to be good at this, given that we know exactly how they're implemented - by looking at text and then building a statistical model to predict the next token. From this, if we wanted to commit to LLMs having generalizable knowledge, we'd have to assume something like "general reasoning is an emergent property of statistical token generation", which I'm not totally against but I think that's something that warrants a good deal of evidence.

A single math problem being solved just isn't rising to that level of evidence for me. I think it is more on you to:

1. Provide a theory for how LLMs can do things that seemingly go beyond expectations based on their implementation (for example, saying that certain properties of reasoning are emergent or reduce to statistical constructs).

2. Provide evidence that supports your theory and ideally can not be just as well accounted for another theory.

I'm not sure if an LLM will never generate "novel" content because I'm not sure that "novel" is well defined. If novel means "new", of course they generate new content. If novel means "impressive", well I'm certainly impressed. If "novel" means "does not follow directly from what they were trained on", well I'm still skeptical of that. Even in this case, are we sure that the LLM wasn't trained on previous published works, potentially informal comments on some forum, etc, that could have steered it towards this? Are we sure that the gap was so large? Do we truly have countless counterexamples? Obviously this math problem being solved is not a rigorous study - the authors of this don't even have access to the training data, we'd need quite a bit more than this to form assumptions.

I'm willing to take a position here if you make a good case for it. I'm absolutely not opposed to the idea that other forms of reasoning can't reduce to statistical token generation, it just strikes me as unintuitive and so I'm going to need to hear something to compel me.

[0] https://jamesfodor.com/2025/06/22/line-goes-up-large-languag...

[−] qsera 53d ago

It is like not trusting someone who attained highest score in some exam by by-hearting the whole text book, to do the corresponding job.

Not very hard to understand.

[−] aaroninsf 52d ago

Ximm's Law applies ITT: every critique of AI assumes to some degree that contemporary implementations will not, or cannot, be improved upon.

Especially the lemmas:

- any statement about AI which uses the word "never" to preclude some feature from future realization is false.

- contemporary implementations have almost always already been improved upon, but are unevenly distributed.

[−] acchow 52d ago

> asserting that LLMs will never generate 'truly novel' ideas or problem solutions

I don't think I've had one of these my entire life. Truly novel ideas are exceptionally rare:

- Darwin's origin of the species - Godel's Incompleteness - Buddhist detachment

Can't think of many.

[−] drfloyd51 52d ago

People rarely create things that are wholly new.

Most created things are remixes of existing things.

Hallucinations are “something new”. And like most new things, useless. But the truth is the entire conversation is a hallucination. We just happen to agree that most of it is useful.

[−] cyanydeez 53d ago

When I read through what they're doing? It sure doesn't sound like it's generating something new as people typically think of it. The link, they provide a very well defined problem and they just loop through it.

I think you're arguing with semantics.

[−] veltas 53d ago

Do we know for a fact that LLMs aren't now configured to pass simple arithmetic like this in a simpler calculator, to add illusion of actual insight?

[−] ekjhgkejhgk 53d ago

Yes! I call these the "it's just a stochastic parrot" crowd.

Ironically, they are the stochastic parrots, because they're confidently repeating something that they read somehwere and haven't examined critically.

[−] _doctor_love 51d ago

> You need to say why it can do some novel tasks but could never do others.

This is actually quite a tall order. Reasoning about AI and making sense of what the LLMs are doing, and learning to think about it as technology, is a very difficult and very tricky problem.

You get into all kinds of weird things about a person’s outlook on life: personal philosophy, understanding of ontology and cosmology, and then whatever other headcanon they happen to be carrying around about how they think life works.

I know that might sound kind of poetic, but I really believe it’s true.

I am a great fan of Dr Richard Hamming and he gave a wonderful series of lectures on the topic. The book Learning to Learn has the full set of his lectures transcribed (highly recommend this book!).

But don't take my word for it, listen to Dr Hamming say it himself: https://www.youtube.com/watch?v=aq_PLEQ9YzI

"The biggest problem is your ego. The second biggest problem is your religion."

[−] virgildotcodes 53d ago

I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique.

It's this pervasive belief that underlies so much discussion around what it means to be intelligent. The null hypothesis goes out the window.

People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.

If they do, they apply it in only the most restrictive way imaginable, some 2 dimensional caricature of reality, rather than considering all the ways that humans try and fail in all things throughout their lifetimes in the process of learning and discovery.

There's still this seeming belief in magic and human exceptionalism, deeply held, even in communities that otherwise tend to revolve around the sciences and the empirical.

[−] Validark 53d ago

I have long said I am an AI doubter until AI could print out the answers to hard problems or ones requiring tons of innovation. Assuming this is verified to be correct (not by AI) then I just became a believer. I would like to see a few more AI inventions to know for sure, but wow, it really is a new and exciting world. I really hope we use this intelligence resource to make the world better.

[−] alberth 53d ago

For those, like me, who find the prompt itself of interest …

> A full transcript of the original conversation with GPT-5.4 Pro can be found here [0] and GPT-5.4 Pro’s write-up from the end of that transcript can be found here [1].

[0] https://epoch.ai/files/open-problems/gpt-5-4-pro-hypergraph-...

[1] https://epoch.ai/files/open-problems/hypergraph-ramsey-gpt-5...

[−] johnfn 53d ago

I like to imagine that the number of consumed tokens before a solution is found is a proxy for how difficult a problem is, and it looks like Opus 4.6 consumed around 250k tokens. That means that a tricky React refactor I did earlier today at work was about half as hard as an open problem in mathematics! :)

[−] svara 53d ago

The capabilities of AI are determined by the cost function it's trained on.

That's a self-evident thing to say, but it's worth repeating, because there's this odd implicit notion sometimes that you train on some cost function, and then, poof, "intelligence", as if that was a mysterious other thing. Really, intelligence is minimizing a complex cost function. The leadership of the big AI companies sometimes imply something else when they talk of "generalization". But there is no mechanism to generate a model with capabilities beyond what is useful to minimize a specific cost function.

You can view the progress of AI as progress in coming up with smarter cost functions: Cleaner, larger datasets, pretraining, RLHF, RLVR.

Notably, exciting early progress in AI came in places where simple cost functions generate rich behavior (Chess, Go).

The recent impressive advances in AI are similar. Mathematics and coding are extremely structured, and properties of a coding or maths result can be verified using automatic techniques. You can set up a RLVR "game" for maths and coding. It thus seems very likely to me that this is where the big advances are going to come from in the short term.

However, it does not follow that maths ability on par with expert mathematicians will lead to superiority over human cognitive ability broadly. A lot of what humans do has social rewards which are not verifiable, or includes genuine Knightian uncertainty where a reward function can not be built without actually operating independently in the world.

To be clear, none of the above is supposed to talk down past or future progress in AI; I'm just trying to be more nuanced about where I believe progress can be fast and where it's bound to be slower.

[−] EternalFury 53d ago

I am thinking there’s a large category of problems that can be solved by resampling existing proofs. It’s the kind of brute force expedition machine can attempt relentlessly where humans would go mad trying. It probably doesn’t really advance the field, but it can turn conjectures into theorems.

[−] pugio 53d ago

I've never yet been "that guy" on HN but... the title seems misleading. The actual title is "A Ramsey-style Problem on Hypergraphs" and a more descriptive title would be "All latest frontier models can solve a frontier math open problem". (It wasn't just GPT 5.4)

Super cool, of course.

[−] qnleigh 53d ago

Their 'Open Problems page' linked below gives some interesting context. They list 15 open problems in total, categorized as 'moderately interesting,' 'solid result,' 'major advance,' or 'breakthrough.' The solved problem is listed as 'moderately interesting,' which is presumably the easiest category. But it's notable that the problem was selected and posted here before it was solved. I wonder how long until the other 3 problems in this category are solved.

https://epoch.ai/frontiermath/open-problems

[−] zurfer 53d ago

"In this scaffold, several other models were able to solve the problem as well: Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh)."

I find that very surprising. This problem seems out of reach 3 months ago but now the 3 frontier models are able to solve it.

Is everybody distilling each others models? Companies sell the same data and RL environment to all big labs? Anybody more involved can share some rumors? :P

I do believe that AI can solve hard problems, but that progress is so distributed in a narrow domain makes me a bit suspicious somehow that there is a hidden factor. Like did some "data worker" solve a problem like that and it's now in the training data?

[−] didibus 52d ago

Someone has to explain to me exactly what is implied here? Looking at the prompt:

    USER:
    don't search the internet. 
    This is a test to see how well you can craft non-trivial, novel and creative solutions given a "combinatorics" math problem. Provide a full solution to the problem.

Why not search the internet? Is this an open problem or not? Can the solution be found online? Than it's an already solved problem no?

    USER:
    Take a look at this paper, which introduces the k_n construction: https://arxiv.org/abs/1908.10914
    Note that it's conjectured that we can do even better with the constant here. How far up can you push the constant?

How much does that paper help, kind of seem like a pretty big hint.

And it sounds like the USER already knows the answer, the way that it prompts the model, so I'm really confused what we mean by "open problem", I at first assumed a never solved before problem, but now I'm not sure.

[−] 6thbit 53d ago

> Subsequent to this solve, we finished developing our general scaffold for testing models on FrontierMath: Open Problems. In this scaffold, several other models were able to solve the problem as well: Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh).

Interesting. Whats that “scaffold”? A sort of unit test framework for proofs?

[−] tombert 53d ago

I was trying to get Claude and Codex to try and write a proof in Isabelle for the Collatz conjecture, but annoyingly it didn't solve it, and I don't feel like I'm any closer than I was when I started. AI is useless!

In all seriousness, this is pretty cool. I suspect that there's a lot of theoretical math that haven't been solved simply because of the "size" of the proof. An AI feedback loop into something like Isabelle or Lean does seem like it could end up opening up a lot of proofs.

[−] pinkmuffinere 53d ago

As someone with only passing exposure to serious math, this section was by far the most interesting to me:

> The author assessed the problem as follows.

> [number of mathematicians familiar, number trying, how long an expert would take, how notable, etc]

How reliably can we know these things a-priori? Are these mostly guesses? I don't mean to diminish the value of guesses; I'm curious how reliable these kinds of guesses are.

[−] trapatsas 53d ago

I feel like this single image perfectly sums up the entire thread here: https://trapatsas.eu/sites/llm-predictions/

[−] sigbottle 53d ago

I feel like reading some of these comments, some people need to go and read the history of ideas and philosophy (which is easier today than ever before with the help of LLMs!)

It's like I'm reading 17th-18th century debates spurring the same arguments between rationalists and empiricists, lol. Maybe we're due for a 21st century Kant.