I've always said this but AI will win a fields medal before being able to manage a McDonald's.
Math seems difficult to us because it's like using a hammer (the brain) to twist in a screw (math).
LLMs are discovering a lot of new math because they are great at low depth high breadth situations.
I predict that in the future people will ditch LLMs in favor of AlphaGo style RL done on Lean syntax trees. These should be able to think on much larger timescales.
Any professional mathematician will tell you that their arsenal is ~ 10 tricks. If we can codify those tricks as latent vectors it's GG
Some DeepMind researchers used mechanistic interpretability techniques to find concepts in AlphaZero and teach them to human chess Grandmasters: https://www.pnas.org/doi/10.1073/pnas.2406675122
This argument, that LLMs can develop new crazy strategies using RLVR on math problems (like what happened with Chess), turns out to be false without a serious paradigm shift. Essentially, the search space is far too large, and the model will need help to explore better, probably with human feedback.
Yes but "the search space is too large" is something that has been said about innumerable AI-problems that were then solved. So it's not unreasonable that one doubts the merit of the statement when it's said for the umpteenth time.
I should have been more specific then. The problem isn't that the search space is too large to explore. The problem is that the search space is so large that the training procedure actively prefers to restrict the search space to maximise short term rewards, regardless of hyperparameter selection. There is a tradeoff here that could be ignored in the case of chess, but not for general math problems.
This is far from unsolvable. It just means that the "apply RL like AlphaGo" attitude is laughably naive. We need at least one more trick.
I agree that LLMs are a bad fit for mathematical reasoning, but it's very hard for me to buy that humans are a better fit than a computational approach. Search will always beat our intuition.
Yes and no. I think we have vastly underestimated the extent of the search space for math problems. I also think we underestimate the degree to which our worldview influences the directions with which we attempt proofs. Problems are derived from constructions that we can relate to, often physically. Consequently, the technique in the solution often involves a construction that is similarly physical in its form. I think measure theory is a prime example of this, and it effectively unlocked solutions to a lot of long-standing statistical problems.
That linked article says its about RLVR but then goes on to conflate other RL with it, and doesn't address much in the way of the core thinking that was in the paper they were partially responding to that had been published a month earlier[0] which laid out findings and theory reasonably well, including work that runs counter to the main criticism in the article you cited, ie, performance at or above base models only being observed with low K examples.
That said, reachability and novel strategies are somewhat overlapping areas of consideration, and I don't see many ways in which RL in general, as mainly practiced, improves upon models' reachability. And even when it isn't clipping weights it's just too much of a black box approach.
But none of this takes away from the question of raw model capability on novel strategies, only such with respect to RL.
Why must it involve understanding? I feel like you’re operating under the assumption that functionalism is the “correct” philosophical framework without considering alternative views.
Even that is probably too much. It has no understanding of what "chess" is, or what a chess board is, or even what a game is. And yet it crushes every human with ease. It's pretty nuts haha.
As a professional mathematician, I would say that a good proof requires a very good representation of the problem, and then pulling out the tricks. The latter part is easy to get operating using LLMs, they can do it already. It's the former part that still needs humans, and I'm perfectly fine with that.
> I've always said this but AI will win a fields medal before being able to manage a McDonald's.
I love this and have a corollary saying: the last job to be automated will be QA.
This wave of technology has triggered more discussion about the types of knowledge work that exist than any other, and I think we will be sharper for it.
Are they actually producing new math? In the most recent ACM issue there was an article about testing AI against a math bench that was privately built by mathematicians, and what they found is that even though AI can solve some problems, it never truly has come up with something novel and new in mathematics, it is just good at drawing connections between existing research and putting a spin on it.
> I predict that in the future people will ditch LLMs in favor of AlphaGo style RL done on Lean syntax trees. These should be able to think on much larger timescales.
This is certainly my hope.
In my spare time, I'm slowly, very slowly, inching towards a prototype of something that could work like that.
I think this is mostly about existing legislature, not about technology.
In any other context than when your paycheck depends on it, you would probably not be following orders from a random manager. If your paycheck depended on following the instructions of an AI robot, the world might start to look pretty scary real soon.
Like so many things -- the evolution of AI math will I think follow trajectories hinted at in the 90s by the all time great sci-fi author Greg Egan. The nature of math won't change -- but the why of it definitely will. Egan imagined a future ai civilization in Diaspora where "math discovery" -- by nature in the future perhaps accurately described as "mechanistic math discovery" is modeled by society as a kind of salt mine environment in which you can dig for arbitrarily long amounts of time and find new nuggets. The nuggets themselves have a kind of "pure value" as mathematical objects even if they might not have any knowable value outside the mines. Some personalities were interested in and valued the nuggets for their own sake while others didn't but recognized that there were occasionally nuggets found in the mind that had broader appeal.
Research institutes like those founded by Terence Tao in our current present feel like they will align to this future almost perfectly on a long enough timeline -- tho I think on a shorter timeline this area of research is almost certain to provide a ton of useful ways to advance our current ai systems as our current systems are still in a state where literally anything that can generate new information that is "accurate" in some way -- like our current theorem prover engines are enormously valuable parts of our still manually curated training loops.
Interesting but not surprising to me. Once a field expert guides the models, they most likely will reach a solution. The models are good at lazy work for experts. For hard or complicated questions, many a time the models have blind spots.
There are people who think knowledge discovery is just a matter of parroting past behavior and trying things at random until something sticks. I don’t.
In the paper, they give part of their system prompt:
> * After EVERY exploreXX.py run, IMMEDIATELY update this file [plan.md]
before doing anything else. * No exceptions. Do not start the next exploration
until the previous one is documented here.
Is this known to improve performance for advanced problem solving? If so, why this specific prompt?
Ramanujan is a good analogy for this situation. Theories could be right/wrong, until there's a proof. Same with anything AI produces. There's always a "told you so" baked in with it's response.
When I was younger I remember a point of demarcation for me was learning the 4chan adage “trolls trolling trolls”, and approaching all internet interactions with skepticism. While I have been sure that Reddit for a while has succumbed to being “dead internet”. This thread is another moment for me- I can no longer recognize who is a bot, and who has honest intentions.
184 comments
Math seems difficult to us because it's like using a hammer (the brain) to twist in a screw (math).
LLMs are discovering a lot of new math because they are great at low depth high breadth situations.
I predict that in the future people will ditch LLMs in favor of AlphaGo style RL done on Lean syntax trees. These should be able to think on much larger timescales.
Any professional mathematician will tell you that their arsenal is ~ 10 tricks. If we can codify those tricks as latent vectors it's GG
Ergo these are latent vectors in our brain. We use analogies like geometry in order to use Algebraic Geometry to solve problems in Number Theory.
An AI trained on Lean Syntax trees might develop it's own weird versions of intuition that might actually properly contain ours.
If this sounds far fetched, look at Chess. I wonder if anyone has dug into StockFish using mechanistic interpretability
https://arxiv.org/abs/2504.13837
[1] https://www.vice.com/en/article/a-human-amateur-beat-a-top-g...
This is far from unsolvable. It just means that the "apply RL like AlphaGo" attitude is laughably naive. We need at least one more trick.
As you said brute forcing the search space as the starting procedure would take way too long for the AI to build intuition.
But if we could give it a million or so lemmas of human math, that would be a great starting point.
That said, reachability and novel strategies are somewhat overlapping areas of consideration, and I don't see many ways in which RL in general, as mainly practiced, improves upon models' reachability. And even when it isn't clipping weights it's just too much of a black box approach.
But none of this takes away from the question of raw model capability on novel strategies, only such with respect to RL.
[0] https://arxiv.org/pdf/2506.14245
> I've always said this but AI will win a fields medal before being able to manage a McDonald's.
I love this and have a corollary saying: the last job to be automated will be QA.
This wave of technology has triggered more discussion about the types of knowledge work that exist than any other, and I think we will be sharper for it.
> Any professional mathematician will tell you that their arsenal is ~ 10 tricks. If we can codify those tricks as latent vectors it's GG
And if we can train the systems to discover new tricks, whoa Nelly.
> I predict that in the future people will ditch LLMs in favor of AlphaGo style RL done on Lean syntax trees. These should be able to think on much larger timescales.
This is certainly my hope.
In my spare time, I'm slowly, very slowly, inching towards a prototype of something that could work like that.
> AI will win a fields medal before being able to manage a McDonald's
Of course, because it takes multi-modal intelligence to manage a McDonalds. I.e. it requires human intelligence.
> I predict that in the future people will ditch LLMs in favor of AlphaGo style RL
Same for coding as well. LLM's might be the interface we use with other forms of AI though.
1. https://mppbench.com/
In any other context than when your paycheck depends on it, you would probably not be following orders from a random manager. If your paycheck depended on following the instructions of an AI robot, the world might start to look pretty scary real soon.
Research institutes like those founded by Terence Tao in our current present feel like they will align to this future almost perfectly on a long enough timeline -- tho I think on a shorter timeline this area of research is almost certain to provide a ton of useful ways to advance our current ai systems as our current systems are still in a state where literally anything that can generate new information that is "accurate" in some way -- like our current theorem prover engines are enormously valuable parts of our still manually curated training loops.
> * After EVERY exploreXX.py run, IMMEDIATELY update this file [plan.md] before doing anything else. * No exceptions. Do not start the next exploration until the previous one is documented here.
Is this known to improve performance for advanced problem solving? If so, why this specific prompt?
How long will it take before they rob a bank?
If they do either of those things will the results have been intentional from the simian’s POV?