NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute

[−] NooneAtAll3 57d ago

I thought "data efficiency" meant same quality with less parameters

instead it's more parameters with less training data... but I don't really see any quality control?

[−] naasking 57d ago

Seems like they're relying on the loss as a measure, at least for now.

[−] littlestymaar 58d ago

> Data efficiency matters because compute grows much faster than data [2] (referencing a paper from 2022)

I'm not convinced this is particularly true in today's world, if you have more compute, you can simply generate more, and higher quality, artificial data. That's what all labs have been doing since at least 2023.

Also, the post references the Chinchilla-optimal training as a comparison baseline, but everyone has moved far beyond Chinchilla scaling, small models are routinely trained on 10-400 times more data than (1-40T tokens) than the Chinchilla-optimal number, so the entire industry went the complete opposite of what they are proposing.

That doesn't mean the techniques presented here are useless or anything (I'm not qualified to judge) but you should take the introduction with a grain of salt.

[−] ACCount37 58d ago

There's "cheap" bulk data - simple synthetics, unfiltered scrapes. Used for pre-training, especially early pre-training. And then there's "expensive" data. Human domain expert solutions, made by people you hire for $100 an hour. Used for SFT.

For "expensive" data, it makes a lot of sense to use every trick in the book to squeeze that data for all its worth.

[−] akshayvegesna 58d ago

You seem to be making two points: - synthetic data is a valuable direction to pursue when you have compute - chinchilla scaling laws have some flaws for small models Both of these are side points to the core purpose of the Slowrun.

The main point is the 100M tokens we train on push people to come up with novel ideas to improve pretraining, outside of facile synthetic data generation. I think we should continue to push on synthetic data, but why not come up with some new ideas too? You cannot use synthetic data for everything (see sdpmas's point)

[−] sdpmas 58d ago

> you can simply generate more, and higher quality, artificial data

this is simply not true. and it's very clear if you look at continual learning, robotics, biology, etc. each has enough economic incentives to spend 1000x compute if that led to much better results, but we just don't know how to do that.

good point on chinchilla, but our models are still absurdly large no matter what standards you compare them to.

[−] littlestymaar 58d ago

> this is simply not true. and it's very clear if you look at continual learning, robotics, biology, etc. each has enough economic incentives to spend 1000x compute if that led to much better results, but we just don't know how to do that

I'm (and so is the post itself) talking about LLMs in particular, and this is indeed true for LLM.

[−] ColonelPhantom 57d ago

If generating synthetic data is such a great way to improve performance, why would it not be applied to the slowrun? Especially for the unlimited compute track, you should have plenty of time to generate as much synthetic data as your heart desires.

Intuitively, I would expect the synthetic data to mostly just "regurgitate" the existing data, and not add much. But I could be wrong of course, and perhaps doing reinforcement learning somewhere could solve that issue as well (though I don't know if there is much hidden in FineWeb that you could RL on; at best you can do self-verification probably?)

[−] andai 58d ago

What's the human baseline? How many cats does a human need to see to learn what a cat is, vs an AI?

Maybe not quite a fair comparison since my human brain has been "learning" for half a billion years before I was born.

I wonder if there's an equivalent of that for AI. Evolving the architectures?

[−] nsnzjznzbx 58d ago

We will get to the point where you can quickly bootstrap i.e. an LLM can train a better LLM in a loop, leave it and it can really learn. Like learn learn.

"Train yourself to solve this problem see OBJECTIVE.md"

[−] pastescreenshot 57d ago

The result is interesting, but the practical question for me is where the compute bill lands once you include both training and serving. If a fixed-data regime pushes you toward ensembles plus chain distillation, is the endgame “serve the ensemble”, or do you expect most of the gain can be compressed back into a single deployable model later? That seems like the difference between a neat scaling result and a generally usable recipe.

[−] naasking 57d ago

Great project. On the matter of data efficiency and regularization, I'd love to see someone try scaling GrokAlign!

[−] abeppu 58d ago

In their little algorithm box on Chain Distillation, they have at step 2b some expression that involves multiplying and dividing by T, and then they say "where α = 0.5, T = 1.0".

I think someone during the copy-editing process told them this needed to look more complicated?

[−] phr4ts 57d ago

The brain does optimization during sleep. Is that something llms can benefit from?

[−] yorwba 58d ago

Related: Discussion on the initial NanoGPT Slowrun announcement: https://news.ycombinator.com/item?id=47251259 (185 points 15 days ago, 39 comments)

[−] webagent255 57d ago

[dead]

[−] myylogic 58d ago

[dead]

[−] aledevv 58d ago

[dead]

[−] QubridAI 57d ago

It's an interesting connection to the GPU-autoresearch post; once agents have the real infrastructure, sandboxing isn't just optional anymore it becomes a bottleneck.

[−] AliEveryHour16 58d ago

[dead]

[−] 1425curlz80 58d ago

[dead]

NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute (qlabs.sh)

46 comments