Tree Search Distillation for Language Models Using PPO (ayushtambde.com)

by at2005 • 10 comments • 87 points

10 comments

[−] supermdguy 63d ago

> One might note that MCTS uses more inference compute on a per-sample basis than GRPO: of course it performs better

This part confused me, it sounded like they were only doing the MCTS at train time, and then using GRPO to distill the MCTS policy into the model weights. So wouldn’t the model still have the same inference cost?

[−] at2005 62d ago

Ah, I meant that MCTS uses more inference-time compute (over GRPO) to produce a training sample

[−] qumpis 62d ago

I may never understand what harness means - it's used in so many contexts

[−] blamestross 62d ago

Its a thing that isn't part of the "subject", used with the subject, to manipulate the state of the "the subject" to be closer to what we want.

[−] natufunu 62d ago

Great post! I wonder why MCTS is not more popular as a test time compute harness. Did you compare performance of MCTS (without distillation) against other methods (eg best of N) with the same compute budget?

[−] at2005 61d ago

I didn't compare with the harness (focused on distillation) but the original ToT paper has a section on it: https://arxiv.org/abs/2305.10601

[−] richardvsu 62d ago

Why is almost every RL paper done on Qwen-2.5 ? That decreases its credibility.

[−] algo_trader 62d ago

great write up (and effort!! ;))

what are your thoughts on MCTS for coding?

this can/must be paired with a smart execution harness to optimise roll out and roll back of execution paths and system state.

does this change the calculus for optimal post-training ?

[−] biang15343100 63d ago

[flagged]

[−] devcraft_ai 62d ago

[flagged]

[−] puildupO 63d ago

[dead]