Improving Composer through real-time RL

[−] CitrusFruits 49d ago

I've been wondering how they've been able to be so generous with Composer usage with it still making business sense. Seems like this is the answer: presumably they think they'll have a competitive advantage in not just the UX space but the model space as well soon. It's a great strategy, but I do wonder if the moat will be big enough with how fast things are moving and how competitive the model landscape is.

[−] pillsburycat 49d ago

Important disclaimer for anyone using Cursor: make sure to disable "data sharing" in your account settings, as it is enabled by default and old accounts are automatically opted into it.

[−] vicchenai 49d ago

the rl loop here is clever but i wonder how the reward signal degrades over time. if you're optimizing for user acceptance of suggestions, you're inevitably training on a mix of "this was actually correct" and "i accepted because editing the suggestion was more work than accepting it." that second case creates a subtle bias toward suggestions that are close-enough-to-not-bother-fixing rather than actually correct.

also curious whether they see different convergence patterns across languages. my gut says something like python where theres more stylistic variation would be harder to get a clean reward signal vs something like rust where there are fewer idiomatic ways to do things.

[−] kgeist 49d ago

>We used a Kimi base, with midtraining and RL on top. Going forward, we'll include the base used in our blog posts, that was a miss. Also, the license is through Fireworks. [0]

And still no mention of Kimi in a new blog post :)

Also apparently the inference provider they use, Fireworks AI, already has built-in API for RL tuning Kimi [1], so I wonder which parts are Cursor's own effort and where Fireworks AI actually deserves credit, especially since they repeatedly brag about being able to create a new checkpoint every 5 hours, which would be largely thanks to Fireworks AI's API/training infrastructure.

I mean, I'm genuinely curious how much effort it would actually take me to go from "here, lots of user data" to "the model gains +1% on benchmarks" to produce my own finetune, assuming I already use a good existing foundational model, my inference provider already handles all the tuning infrastructure/logic, and I already have a lot of usage logs.

[0] https://news.ycombinator.com/item?id=47459529

[1] https://fireworks.ai/blog/kimi-k2p5

[−] crazylogger 49d ago

This feels so wrong. the LLM should play the role of a very general (but empty & un-opinionated) brain - you don’t want to perform a coding-specific lobotomy on someone every day. The proper target of their RL should have been their harness. That’s what determines the agent's trajectory as much as the base model.

I also wonder since they’re doing constant RL on model weights with today's Cursor design, does that mean they can never change their system prompt & other parts of the harness?

1) Comparison between past trajectories data would be meaningless if they were operating under different instructions.

2) Performance will be terrible the next time they change their tool design, since the model is now "opinionated" based on how a previous version of Cursor was designed.

Anthropic is more sensible with their “constitution” approach to safety. The behaviors (and ultimately the values) you want your model to follow should be a document, not a lobotomy.

[−] hmartin 49d ago

Step 1: take an open source model with zero acknowledgement.

Step 2: build on someone else's infrastructure innovations with zero acknowledgement.

Step 3: Write a blog post with "unprecedented" and "100x" and "trillions" in the first paragraph.

Seriously, this seems like cool work and enjoyed the post. But my basic trust in them has completely tanked.

[−] janalsncm 49d ago

Back in my day we called this real time training from implicit user feedback.

The engineering challenge here is an order of magnitude bigger though. An LLM is orders of magnitude bigger than a recommender system model. Kudos.

[−] htrp 49d ago

If the model "improves" every 5 hours, how do you have any guarantee of model consistency across long coding sessions?

[−] fzysingularity 49d ago

Real-time or continuous learning is great on paper, but to get this to work without extremely expensive regression testing and catastrophic forgetting is a real challenge.

Credit to the team for taking this on, but I’d be skeptical of announcements like this without at least 3–6 months of proven production deployments. Definitely curious how this plays out.

[−] DeathArrow 49d ago

What training do they claim to make since Composer 2 is just Kimi K2.5? Do they have a collaboration with Kimi team to help with training.

I use Cursor heavily and pushed for its adoption at work a year ago, but with every day passing I like and trust it less and I am beginning to think about alternatives.

[−] polishdude20 49d ago

I'd love to see some data for how much it has improved via this process in the last week

[−] amazingamazing 49d ago

seems expensive. distillation is inherently impossible to defend against. sit back and let your competitors do the hard work. they'll whine and say it's illegal, but they shouldn't complain, they will reap what they sowed.

[−] meric_ 49d ago

What are with these comments? Did no one even read the article. Not once is composer 2, or kimi, or anything mentioned... because the article is about Composer 1.5.

I mean sure the techniques are probably the same in 2 but its not like they're exactly advertising composer 2 here lol

[−] adshotco 49d ago

[dead]

[−] hikaru_ai 49d ago

[dead]

[−] alcor-z 49d ago

[dead]

[−] ax3726 49d ago

[dead]

[−] nimchimpsky 49d ago

[dead]

[−] gurachek 49d ago

[dead]

[−] g3dar 49d ago

[flagged]

Improving Composer through real-time RL (cursor.com)

34 comments