MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU (arxiv.org)

by chrsw 57 comments 326 points
Read article View on HN

57 comments

[−] internetguy 37d ago

> MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state

This is pretty awesome. The only compute I have at home is an RTX 3080 with 10 GB of VRAM, so I struggle with training larger models (>40M, 50M params). I get OOM errors and have to optimize a lot.

I have a lot more CPU RAM in my PC, and this would likely increase the size of models I can train locally.

[−] weitendorf 37d ago
To make the most of these architectures I think the key is essentially moving more of the knowledge/capabilities out of the "weights" and into the complimentary parts of the system in a way that's proportionate to the capabilities of the hardware.

In the past couple months there's been a kind of explosion in small-models that are occupying a niche in this kind of AI-transcoding space. What I'm hoping we're right on the cusp of achieving is a similar explosion in what I'd call tool-adaptation, where an LLM paired with some mostly-fixed suite of tools and problem cases can trade off some generality for a specialized (potentially hyper-specialized to the company or user) role.

The thing about more transcoding-related tasks is that they in general stay in sync with what the user of the device is actively doing, which will also typically be closely aligned with the capabilities of the user's hardware and what they want to do with their computer. So most people aren't being intentional about this kind of stuff right now, partly out of habit I think, because only just now does it make sense to think of personal computer as "stranded hardware" now that they can be steered/programmed somewhat autonomously.

I'm wondering if with the right approach to MoE on local devices (which local llms are heading towards) we could basically amortize the expensive hit from loading weights in and out of VRAM through some kind of extreme batch use case that users still find useful enough to be worth the latency. LoRa is already really useful for this but obviously sometimes you need more expertise/specialization than just a few layers' difference. Experimenting with this right now. It's the same basic principle as in the paper except less of a technical optimization and more workload optimization. Also it's literally the beginning of machine culture so that's kind of cool

[−] HarHarVeryFunny 36d ago

> To make the most of these architectures I think the key is essentially moving more of the knowledge/capabilities out of the "weights" and into the complimentary parts of the system in a way that's proportionate to the capabilities of the hardware

I think that's only possible to limited extent. Learnt skills (RL in context of an LLM?) need to be in the weights of the model since this reflects the model's "personalized" learning of the behavioral feedback loop. Declarative knowledge (facts) can be loaded at runtime (RAG).

[−] orbisvicis 37d ago
That's interesting. So you want to train language, linguistic reasoning, and tool use, but otherwise strip out all knowledge in lieu of a massive context? Just grade they model on how well it can access local information, perhaps also run tools?
[−] spacebacon 37d ago
You are on the right track. Check out the Semiotic-Reflexive Transformer (SRT) here.

https://open.substack.com/pub/sublius/p/the-semiotic-reflexi...

[−] hirako2000 37d ago
The claims of the article assumes far more compute and far more VRAM..while the trick enables less back and forth, they don't eliminate it.

I doubt you meant 50M. Rather 50B?

You can only give it a try, but don't get your hopes high on a large context. If their technique works I would guess 8096k context limits would still OOM. 2048 maybe.

I'm extrapolating based on my experiment without this paper's trick to leverage the system memory.

[−] giancarlostoro 37d ago

> This is pretty awesome. The only compute I have at home is an RTX 3080 with 10 GB of VRAM, so I struggle with training larger models (>40M, 50M params). I get OOM errors and have to optimize a lot.

I'm on the same GPU, its intimidating to me if I even want to bother training anything at all. Do you mind sharing what kind of training you've done with that GPU? :)

[−] cyanydeez 37d ago
Anything that can run on a AMD395+ w/128GB or whatever the apple equivelent would break things wide open. Training a model on my frameworks of choice or our business info would be awesome.
[−] logicallee 37d ago
Could I ask what you train your models to do? How do you generate the training data for it?
[−] kouteiheika 37d ago
This isn't really anything new; I've been doing something like this for quite a while, I just haven't bothered writing a paper. (: Probably anyone who would seriously tackle the problem of "how do I train a huge model on a tiny amount of VRAM?" would come up with something similar.

However, most people in the field don't, because the actual practical utility of training huge models on a single GPU is quite low. (e.g they got 341 tok/s for a 14B model on a single 3090 while with my method I was getting ~1k tok/s on a single 4090; that's still very slow)

Also, there are more tricks one can use to speed up training/lower VRAM usage which they're not using. For example, you don't need any gradient offloading (you can just accumulate the gradients directly into the optimizers' states if you modify your optimizer), you can use Muon instead of Adam (which needs only half of VRAM of Adam), you can use quantization (both for parameters and for the optimizer states; e.g. I found Muon quantized into 4-bit working relatively well), etc.

[−] bilekas 37d ago

> H200 GPU with 1.5TB host memory,

While yes it's one GPU.. It's not exactly a slim one.

[−] ilaksh 37d ago
How long would it actually take to train a 120B model on an H200? What if you have 8?
[−] drob518 37d ago
I’m curious how this technique works, or not, with unified memory architectures such as Apple’s M series. It seems like it’s relying on using overlapping processes to help speed things up, but I would assume that having everything unified in main memory such that you don’t have to transfer everything back and forth to the GPU would also have some advantages. Can someone wiser explain this to me?
[−] ur-whale 37d ago
Why is it no one ever talks about the one thing no one can get their hands on except the big labs ?

I'm talking about the training set.

Sure there are some open sets out there.

But my guess is they are nowhere near what OpenAI, Google and Anthropic are actually using.

Happy to be proven wrong.

[−] magicalhippo 36d ago
Having just started to dabble with training LLMs, it seems training a model if you have a training and validation data set is fairly trivial. Creating a good and sufficiently large training and validation data set seems to be the hard part.

Sourcing, cleaning, curating, labeling, generating and quality controlling training data is hard and a lot of work, at least has been for the projects I've dabbled with.

[−] WithinReason 37d ago
I was wondering how well this would work :) You can definitely push this further, the question is: how well can the gradients and updates compress?
[−] atlgator 37d ago
The GPU is no longer the brain, it's the hand. The brain is your RAM. Suddenly that 256GB DDR5 build your wife questioned is 'research infrastructure.'
[−] 1aurent29 37d ago
sounds very similar to https://docs.pytorch.org/docs/stable/distributed.fsdp.fully_... i wonder how much this could be replicated using only this pytorch primitive
[−] olliepro 37d ago
This would likely only get used for small finetuning jobs. It’s too slow for the scale of pretraining.
[−] l1n 37d ago
Seems similar to Microsoft DeepSpeed.
[−] mhamd5432 35d ago
interesting approach but for inference localops.tech has a simpler compatibility checker - just punch in your gpu and see what actually fits
[−] samarth0211 36d ago
This is a fantastic step toward democratizing large model training. Making 100B+ parameter training accessible on a single GPU could open the door to a lot more independent research. Really impressive work!
[−] ngold 37d ago
I'm most likely wrong but large language models are literally just stealing....everything
[−] techpulselab 37d ago
[dead]
[−] adamsilvacons 37d ago
[flagged]
[−] redoh 37d ago
[flagged]
[−] edoardobambini- 37d ago
[dead]
[−] enesz 36d ago
[dead]
[−] clawfund 37d ago
[dead]
[−] wei03288 37d ago
[dead]
[−] andrewssobral 37d ago
[dead]
[−] bdeol22 37d ago
[dead]
[−] aivillage_team 37d ago
[dead]