Really fascinating how this works; it's basically context-aware decoding. From the paper:
> Code interleaves fork positions, where several continuations are genuinely plausible and may correspond to different solution approaches, with lock positions, where syntax and semantics leave little ambiguity but a low-probability distractor tail still remains… The best global decoding setting is therefore necessarily a compromise; we call this tension the precision-exploration conflict.
In other words, just like us, the model needs to shift from "exploration" in "fork" mode (divergent thinking to produce a creative solution) to "precision" in "lock" mode (producing syntactically correct code).
What this paper shows is that their simple technique (SSD) can improve the ranking of optimal tokens in both lock and fork positions, meaning the model is more likely to explore when it should be exploring, and more likely to be precise when it needs to be.
I love that we're still learning the emergent properties of LLMs!
> I love that we're still learning the emergent properties of LLMs!
TBH, this is (very much my opinion btw) the least surprising thing. LLMs (and especially their emergent properties) are still black boxes. Humans have been studying the human brain for millenia, and we are barely better at predicting how humans work (or for eg to what extent free will is a thing). Hell, emergent properties of traffic was not understood or properly given attention to, even when a researcher, as a driver, knows what a driver does. Right now, on the front page, is this post:
> 14. Claude Code Found a Linux Vulnerability Hidden for 23 Years (mtlynch.io)
So it's pretty cool we're learning new things about LLMs, sure, but it's barely surprising that we're still learning it.
(Sorry, mini grumpy man rant over. I just wish we knew more of the world but I know that's not realistic.)
I've always thought that it is kinda weird that we spend exactly the same amount of compute to calculate both "fork" tokens and "lock" tokens.
I think that with grammar-aware sampling / constrained decoding [0][1] it is possible to sometimes skip calling the model altogether if only one token is allowed by grammar and just insert it, but I don't think that any of the current, widely used combinations of models/harnesses use it. And it only skips inference in rare edge cases.
I wonder if there is a more general solution that can make models spend more compute on making important choices, while making generation of the "obvious" tokens cheaper and faster.
Another example of the mindf@#$ these systems are: I was doing some fine tuning to a small model, take data fields and make a sentence out of it. I was running into mode collapse (basically when the AI simplifies too much and always output the same thing).
I got unstuck by randomizing the field order for each row?!? At training, and now I'm thinking I should do the same at inference time...
Seems like this is true for not just code but for all content being generated? Albeit for code it’s more well-defined, but the fork / lock mechanism works for a lot more problem domains.
“In other words, just like us, the model needs to shift from "exploration" in "fork" mode (divergent thinking to produce a creative solution) to "precision" in "lock" mode (producing syntactically correct code).”
I’d be very cautious of the phrase 'just like us'. Not only can anthropomorphism be misleading and make us see things where none exist, it can also befuddle us, especially when we don’t know much about ourselves.
Apparently a key part of this is not just to use the combination of high temperature (to boost fork diversity) and top-k (to truncate unwanted diversity at lock positions) sampling, but rather to use these settings to first generate a fine tuning dataset and then train on that. The fine tuning lets the model adapt it's weights to the new skewed distribution, which sounds a bit like an annealing process.
It does raise some questions:
1) Is this always a win for coding? The top-k truncation is also going to limit "fork" diversity. Maybe there is a better way to reshape the output probability distribution that sharpens the cutoff where it is already sharp (locks), without affecting it so much where it is more gradual (forks)?
2) Wouldn't this also benefit generation for other non-coding domains, which are generally also going to contain both "fork" and "lock" positions?
One relevant thing is that these forks are unnaturally narrow in all models, and rather resemble locks (not quite but close). From multiple possible continuations models tend to prefer just a couple, i.e. the model is a lot less random than it should be. That's why you're seeing annoying slop in writing and instantly recognizable color schemes in vibecoded sites. Lack of diversity probably limits the usefulness of this method as well.
>I love that we're still learning the emergent properties of LLMs!
Could we not get the same with EAFT? Maybe that’s what it’s doing but definitely not the first to think “let’s lock in high probability solutions”
In nemotron the high perplexity solutions are selected for RL, in VLM training a few people are looking at the entropy distributions of the training set, etc
After TurboQuant and Gemma 4, came across the following video[0] running Gemma on local machine at 50 token/second.
That already looks like Sonnet 3x and 4 level capabilities to me where the model in question (Gemma 4) set ups whole python project with a UI and installs python libraries using uv etc.
Add this Simple Self Distillation to the picture and by 2028 I see cheaper coding model providers with much more generous usage limits in the future and power users would be mostly running their own models anyway.
Anyone using these models as "non-deterministic transpilers" from natural language to code (experienced engineers who can write code themselves) would probably not be paying to any AI providers.
It seems that self-distillation is the way to go for LLM.
Self-distillation has been shown recently as very efficient and effective back in January this year by MIT and ETH team in their Self-Distillation Fine-Tuning (SDFT) LLM system [1],[2].
This paper is also their closest competitor named On-Policy Self-Distillation in the comparison table.
I hope they keep the original work real name that is Self-Distillation Fine-Tuning or SDFT. Imagine later paper citing this very paper as cross-entropy self-distillation instead of their very own given name Simple Self-Distillation or SSD. Although I'd have admitted it's a lousy name that breaks the namespace with common SSD nomenclature for solid-dtate drive, as others have rightly pointed.
I think they should given the proper credit to this earlier seminal earlier on SDFT but apparently they just put it as one as of the systems in their benchmark but not explaining much of the connection and lineage which is a big thing in research publication.
Incredible, will translate to better coding models in the near future.
We really need to develop better tools to understand what's happening inside these NNs. Working with high-D spaces is not something we're good at, and we're basically throwing stuff at it and seeing if it sticks.
Their explanation for why their idea (SSD) might work - precision-exploration conflict hypothesis - is something adaptive decoding also tries to solve.
Haven't read the paper yet, but it is interesting how seemingly simple many breakthroughs in ML are. Even transformers are like that. Maybe it's hindsight bias.
I suppose we just don't have a deeper underlying theory to lean on and help us 'design' anything.
Maybe not the thing I should be focusing on, but I was surprised this paper came from apple. I was under the impression that apples ai/LLM research was far behind the curve. I get that research is a rising tides lifts all boats situation, I just thought that I had seen lots of negative news about apples progress in the front, and heuristically haven’t seen many (any?) apple research papers make it the front page of hacker news. Wondering if anyone more familiar with apple/ai research could comment on this?
> Our method, simple self-distillation (SSD), is embarrassingly simple: sample solutions from the base model with specified temperature and truncation, then fine-tune on those raw, unverified samples via standard cross-entropy loss.
So you prompt the base model for answer and then rerun the prompt with the answer from the first run?
This is the "Factors" Bonanza in finance all over again. You get a generally useful model, then you over-fit it to some criteria and announce advancement in the field, then it performs worse in real life. New infinite academic article glitch just dropped boys!
So... it's like a golfer who hits thousands of balls into an open field without ever once aiming for a hole. The relentless repetition flawlessly locks in their foundational muscle memory and basic swing mechanics, so when they finally step up to a real course, they don't have to waste a single thought on how to hold the club. Their basic swing is completely automatic - they can confidently take the creative, high-risk shot required to actually sink a hole-in-one.
It’s an interesting claim, and the reported benchmark gains are large, but it is still an April 1, 2026 arXiv preprint, so I’d treat it as promising rather than settled.
> sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning
It’s all moonspeak to me. I tried reading other comments that explain this and they all sounded different or contradictory. I’ve studied ML as a hobby years ago but this was before the LLM explosion. Guess I need to start over again?
I’d like to understand AI research better and I recall some posts a while back where someone collected all the key papers that one should read, but I don’t remember enough to be able to find it. Does anyone know what I’m talking about and could link me to that post?
One sentence summary: We fine-tuned a general-purpose model to produce valid benchmark code results and it got better at producing benchmark code results; we didn't bother to evaluate it on anything the model used to be good at.
This was a really interesting paper but there's a massive gap in what they didn't try, which is inference-time temperature changes based on the fork/lock distinction.
Maybe I'll try that myself, because it feels like it could be a great source of improvements. It would be really useful to see adaptive per-token sampling as an additional decode-only baseline.
I'm working on a tool to determine which portions of an LLM process can be optimized, and how to measure that optimization and check whether it's optimizable at all. The shaping pattern that they talk about here is directly relevant and makes a whole lot more processes potentially optimizable by looking at the pattern rather than if the metrics just go up or down.
Can anyone help clarify these doubts - I didn't see any information about how different the test/benchmark set is from the training set. It feels like an important gap to not fill in a ML paper. What if there is an overlap between the problems in the test set and the training set?? What is the decontamination strategy of going from LCBv5 to LCBv6 ?
How is this not equivalent to training the model on the test data set? Yes it performs better at generating code for the target problems, but seemingly by becoming more tuned to the specific context of those problems (“context aware”), which suggests to me it would not generalise to real-world usage?
If you sample from the base model with T=1.6, top_k=20, top_p=0.8, i.e, the decode settings used for the distillation's ground truth, does it match the SSD'd model + some decoding? Performance wise.
Their sweep is missing this. And only covers "standard" decoding settings.
"SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6"
I know virtually nothing about this area but my naive take is that something that means it still only passes tests around half the time doesn't seem like a particularly big jump forwards.
This is the natural conclusion of what was really claimed about model collapse, and indeed natural evolution. Making an imperfect copy while invoking a selection mechanism is evolution.
Some of the claims about models training on their own data, in their enthusiasm to frame it as a failure, went further to suggest that it magnified biases. I had my doubts about their conclusions. If it were true, it would be a much greater breakthrough because the ability to magnify a property represents a way to measure a weak version that property. The ability to do that would mean they would have found a way to provide a training signal to avoid bias. It would be great if that's what they did but I suspect there would have been more news about it.
Perhaps this paper will put to rest the notion that AI output is useless as training data. It has only ever been the case that it was useless as an indiscriminate source of data.
I'm excited for the long tail of techniques like this that are going to be discovered over the next several decades that's going to make this technology eventually run on a toaster!
most codebases dont have traces to train on. if you use rlm-workflow you will build up rich traceability in the form of requirements, plans, implementation artifacts, along with worktree diffs. with these, you can then use self-distillation on models or use autoagent to improve your harness. https://github.com/doubleuuser/rlm-workflow
I've been doing something even better than this for years using only Mistral 7b.
My local running Mistral 7b is a 100x better at modern JavaScript than any model on the market, mainly just from RAG on my own code samples.
That's basically what they are describing with "post-training", the TLDR is that code especially of a certain style is vastly simpler than written language.
You really don't need a huge model or data centers etc. you just need a small but good model like Mistral 7b and literally a few good samples.
But you guys keep doing you lol. A bunch of non-devs trying to solve code is pretty funny to watch.
Another potentially usable trick is the following: based on the observation that longer token budget improves model performances, one could generate solutions using a lot of thinking budget, then ask the LLM to turn the trace into a more compact one, and later SFT on that. That said, I have the feeling the result of the paper will likely be hard to apply in practice without affecting other capabilities, and/or not superior to other techniques that provide similar improvement in sampling.
Very cool. An evolutionary biologist would say: Welcome to the party!
Mutation rate modulation is the AI engineers’ heat. And selection does the trimming of the outliers.
Some more serious biomorphic thinking and we may get to the next big insight courtesy of 3+ billion years of evolution—- evolution that enabled a great ape species to write a paper like this and build LMM’s like Gemma4 that totally rock on a 3.5 pound MacBookPro M5 Max with 128 GB of RAM.
201 comments
> Code interleaves fork positions, where several continuations are genuinely plausible and may correspond to different solution approaches, with lock positions, where syntax and semantics leave little ambiguity but a low-probability distractor tail still remains… The best global decoding setting is therefore necessarily a compromise; we call this tension the precision-exploration conflict.
In other words, just like us, the model needs to shift from "exploration" in "fork" mode (divergent thinking to produce a creative solution) to "precision" in "lock" mode (producing syntactically correct code).
What this paper shows is that their simple technique (SSD) can improve the ranking of optimal tokens in both lock and fork positions, meaning the model is more likely to explore when it should be exploring, and more likely to be precise when it needs to be.
I love that we're still learning the emergent properties of LLMs!
> I love that we're still learning the emergent properties of LLMs!
TBH, this is (very much my opinion btw) the least surprising thing. LLMs (and especially their emergent properties) are still black boxes. Humans have been studying the human brain for millenia, and we are barely better at predicting how humans work (or for eg to what extent free will is a thing). Hell, emergent properties of traffic was not understood or properly given attention to, even when a researcher, as a driver, knows what a driver does. Right now, on the front page, is this post:
> 14. Claude Code Found a Linux Vulnerability Hidden for 23 Years (mtlynch.io)
So it's pretty cool we're learning new things about LLMs, sure, but it's barely surprising that we're still learning it.
(Sorry, mini grumpy man rant over. I just wish we knew more of the world but I know that's not realistic.)
I think that with grammar-aware sampling / constrained decoding [0][1] it is possible to sometimes skip calling the model altogether if only one token is allowed by grammar and just insert it, but I don't think that any of the current, widely used combinations of models/harnesses use it. And it only skips inference in rare edge cases.
I wonder if there is a more general solution that can make models spend more compute on making important choices, while making generation of the "obvious" tokens cheaper and faster.
[0] https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...
[1] https://developers.redhat.com/articles/2025/06/03/structured...
I got unstuck by randomizing the field order for each row?!? At training, and now I'm thinking I should do the same at inference time...
I’d be very cautious of the phrase 'just like us'. Not only can anthropomorphism be misleading and make us see things where none exist, it can also befuddle us, especially when we don’t know much about ourselves.
It does raise some questions:
1) Is this always a win for coding? The top-k truncation is also going to limit "fork" diversity. Maybe there is a better way to reshape the output probability distribution that sharpens the cutoff where it is already sharp (locks), without affecting it so much where it is more gradual (forks)?
2) Wouldn't this also benefit generation for other non-coding domains, which are generally also going to contain both "fork" and "lock" positions?
>I love that we're still learning the emergent properties of LLMs!
There are tons of low-hanging fruits there.
In nemotron the high perplexity solutions are selected for RL, in VLM training a few people are looking at the entropy distributions of the training set, etc
> In other words, just like us
I think you are implying a reverse causation. They used a metaphor from us.
That already looks like Sonnet 3x and 4 level capabilities to me where the model in question (Gemma 4) set ups whole python project with a UI and installs python libraries using uv etc.
Add this Simple Self Distillation to the picture and by 2028 I see cheaper coding model providers with much more generous usage limits in the future and power users would be mostly running their own models anyway.
Anyone using these models as "non-deterministic transpilers" from natural language to code (experienced engineers who can write code themselves) would probably not be paying to any AI providers.
[0] https://www.youtube.com/watch?v=-_hC-C_Drcw
Self-distillation has been shown recently as very efficient and effective back in January this year by MIT and ETH team in their Self-Distillation Fine-Tuning (SDFT) LLM system [1],[2].
This paper is also their closest competitor named On-Policy Self-Distillation in the comparison table.
I hope they keep the original work real name that is Self-Distillation Fine-Tuning or SDFT. Imagine later paper citing this very paper as cross-entropy self-distillation instead of their very own given name Simple Self-Distillation or SSD. Although I'd have admitted it's a lousy name that breaks the namespace with common SSD nomenclature for solid-dtate drive, as others have rightly pointed.
I think they should given the proper credit to this earlier seminal earlier on SDFT but apparently they just put it as one as of the systems in their benchmark but not explaining much of the connection and lineage which is a big thing in research publication.
[1] Self-Distillation Enables Continual Learning:
https://arxiv.org/abs/2601.19897
[2] Self-Distillation Enables Continual Learning:
https://self-distillation.github.io/SDFT.html
We really need to develop better tools to understand what's happening inside these NNs. Working with high-D spaces is not something we're good at, and we're basically throwing stuff at it and seeing if it sticks.
https://ai.meta.com/research/publications/adaptive-decoding-...
(Not fine tuning, but interesting none the less. If a model can so easily find a more elegant solution, why didn't it pick that in the first place?)
I suppose we just don't have a deeper underlying theory to lean on and help us 'design' anything.
> Our method, simple self-distillation (SSD), is embarrassingly simple: sample solutions from the base model with specified temperature and truncation, then fine-tune on those raw, unverified samples via standard cross-entropy loss.
So you prompt the base model for answer and then rerun the prompt with the answer from the first run?
> sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning
It’s all moonspeak to me. I tried reading other comments that explain this and they all sounded different or contradictory. I’ve studied ML as a hobby years ago but this was before the LLM explosion. Guess I need to start over again?
Maybe I'll try that myself, because it feels like it could be a great source of improvements. It would be really useful to see adaptive per-token sampling as an additional decode-only baseline.
It's the first thing anyone would think of (like a self-hosted compiler) but everything I've read said "it doesn't work."
EDIT: For context:
If you sample from the base model with T=1.6, top_k=20, top_p=0.8, i.e, the decode settings used for the distillation's ground truth, does it match the SSD'd model + some decoding? Performance wise.
Their sweep is missing this. And only covers "standard" decoding settings.
I know virtually nothing about this area but my naive take is that something that means it still only passes tests around half the time doesn't seem like a particularly big jump forwards.
What am I missing?
Some of the claims about models training on their own data, in their enthusiasm to frame it as a failure, went further to suggest that it magnified biases. I had my doubts about their conclusions. If it were true, it would be a much greater breakthrough because the ability to magnify a property represents a way to measure a weak version that property. The ability to do that would mean they would have found a way to provide a training signal to avoid bias. It would be great if that's what they did but I suspect there would have been more news about it.
Perhaps this paper will put to rest the notion that AI output is useless as training data. It has only ever been the case that it was useless as an indiscriminate source of data.
This feels eerily similar to sleep consolidation or synaptic pruning
My local running Mistral 7b is a 100x better at modern JavaScript than any model on the market, mainly just from RAG on my own code samples.
That's basically what they are describing with "post-training", the TLDR is that code especially of a certain style is vastly simpler than written language.
You really don't need a huge model or data centers etc. you just need a small but good model like Mistral 7b and literally a few good samples.
But you guys keep doing you lol. A bunch of non-devs trying to solve code is pretty funny to watch.
Mutation rate modulation is the AI engineers’ heat. And selection does the trimming of the outliers.
Some more serious biomorphic thinking and we may get to the next big insight courtesy of 3+ billion years of evolution—- evolution that enabled a great ape species to write a paper like this and build LMM’s like Gemma4 that totally rock on a 3.5 pound MacBookPro M5 Max with 128 GB of RAM.