Nvidia greenboost: transparently extend GPU VRAM using system RAM/NVMe

[−] rnrn 58d ago

Why is there a new kernel driver here at all? It appears that all it does it allocate system RAM (“DDR4”) and export it as a dmabuf for import to cuda as mapped external memory. Then a userspace shim hijacks APIs to use that if gpu memory is full. cuda already supports allocating mapped system memory, so AFAICT this could be implemented in the userspace shim with no new kernel driver.

Also as other commenters have mentioned, redirecting allocations to managed memory would also enable similar oversubscription

And the hijack approach only makes sense for making apps have this behavior with no changes, and could be done with minor app changes (e.g. PyTorch has a pluggable allocator interface). App changes also enable intentionally placing specific allocations.

My impression is that this is vibe from beginning to end, starting from a design that only makes sense if you are hallucinating

[−] Melatonic 58d ago

Maybe theres a significant latency advantage to doing it this way?

Or, as you said, making everything backwards compatible that is not being regularly updated

[−] 0xbadcafebee 59d ago

You can already do this with some GPU drivers:

  GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdttm.pages_limit=5242880 ttm.pages_limit=5242880"

One downside is your kernel isn't going to reserve that memory away from userland. You will still see all the memory at system level as "free". As the GPU driver starts using it, other apps/the OS will try to use the "free" memory, not knowing how much of it is in use (it may show up as "cache", or not at all). Then OOM killer starts going or programs start crashing, and at some point the OS tips over or GPU driver crashes. You can add loads of swap as a compromise and it works okay, if a bit slow.

In any case, loading a gigantic model just to use system RAM is absurdly slow (due to mem bandwidth), like 1-5 t/s, so it's not practical. It'd take a whole day to process one 86k token request. Just pay a cloud provider $0.01 to do it in 10 seconds.

[−] jmward01 59d ago

The point is not how fast it is now. The point is that this opens new possibilities that can be built on. Potentially models that are trained with slightly different architectures to optimize to this use case. Possibly others come to improve this path. Possibly HW manufacturers make a few small adjustments that remove bottlenecks. Who knows, the next person may combine CPU compute with this mem sharing to get another token a second. Then the next person does predictive loading into memory to keep that bandwith 100% maxed and usable. Then the next does and the next does. Before you know it there is a real thing there that never existed.

This is a great project. I love the possibilities it hints at. Thanks for building it!

[−] smallnamespace 59d ago

It’s architecturally not a good approach. System RAM is much slower so you should put data that doesn’t need to be used often on it. That knowledge is at the application layer. Adding a CUDA shim makes system RAM appear like VRAM, which gets things to run, but it will never run very well.

The benchmarks at the bottom mention memory tiering and manually controlling where things go, but if your application already does that, then you probably don’t also need a CUDA shim. The application should control the VRAM to system memory transfers with boring normal code.

[−] jbverschoor 58d ago

Not true for unified systems. And for strix halo you need to dedicate the amount which is annoying.

You’re basically stating that swapping is also a bad idea. And to take it further, any memory or storage is a bad idea because there’s L1 cache/SRAM which is faster then the rest

[−] timnetworks 58d ago

Some people are not concerned with having it run the fastest, just having it run at all may be enough.

[−] jmward01 58d ago

> It’s architecturally not a good approach.

Yes, with current LLMs and current hardware and current supporting software this is a true statement. My point wasn't that this approach suddenly changes that, it was that it makes it easier to explore alternatives that might change that. Let's imagine some possibilities:

- Models that use a lot of weight reuse: If you strategically reuse layers 3-4x that could give a lot of time for async loading of future weights.

- Models that select experts for several layers at a time: Same thing, while crunching on the current layer you have teed-up future layers that can be transferring in

- HW makers start improving memory bandwidth: This is already happening right? AMD and Apple are pushing unified memory architectures with much higher bandwidth but still not quite there compared to GPUs. This could lead to a hybrid approach that makes those machines much more competitive. similarly, HW makers could bring back technologies that died on the vine that could help, things like Intel's optaine come to mind. Start making mass storage as fast as system memory is now and the equation may change.

These are quick dart throws that probably have obvious holes in them but the point is platforms like this help us explore paths that appeared dead-end until that one change makes them viable and then allows them to take over. It may not happen. It may be a dead end. But that logic means we will never go out on a limb and try something new. We need people and tech that challenges assumptions and makes it easy for people to try out ideas to keep the tech ecosystem evolving. This does that. Even if this particular project doesn't succeed it is a great thing to do if for no other reason it likely just spurred a bunch of people to try their own crazy hacks for LLM inference. Maybe it even enabled a use case with GPUs that nobody realized existed and has nothing to do with LLMs.

[−] adrian_b 58d ago

With discrete GPUs, using system RAM is slow not due to mem bandwidth, but due to PCIe bandwidth, which is the bottleneck.

For example, 16x PCIe 4.0: 256 Gb/s, 16x PCIe 5.0: 512 Gb/s, while 2x DDR5-6400 DIMMs: 819 Gb/s. The actual throughput is lower for both PCIe and DDR5, due to communication overhead.

On server/workstation motherboards which may have 4, 8 or 12 DIMMs instead of 2, the ratio between memory bandwidth and PCIe bandwidth becomes proportionally higher, so the memory throughput achievable by the GPU becomes a very small fraction of the system memory bandwidth.

[−] lelanthran 58d ago

> any case, loading a gigantic model just to use system RAM is absurdly slow (due to mem bandwidth), like 1-5 t/s, so it's not practical. It'd take a whole day to process one 86k token reques

So don't use it for large requests. Ideal for when you just want to categorise things, for example, "does this task need a shell" or "bucket this email into one of help request, bill due or personal comms".

[−] robotswantdata 58d ago

12 channel ddr5 5600 ECC is around 500gbs which in real world works very well for large MoE

[−] RobotToaster 58d ago

Would MoE models work better with this approach?

[−] daneel_w 59d ago

Related, a couple of years ago: https://old.reddit.com/r/Amd/comments/15t0lsm/i_turned_a_95_...

"I turned a $95 AMD APU into a 16GB VRAM GPU and it can run stable diffusion!"

[−] nl 59d ago

This is really interesting engineering, but I agree with the other commentators that the benchmarking makes it hard to understand the contribution various factors are having.

The ExLlamaV3 EXL3 2bpw (8 GB, full VRAM) row is an order of magnitude faster than the baseline - but the baseline seems to be the 32GB model running with the KV cache shared to system memory only (I think?)

But if a 8GB model gives sufficient quality then it seems like that would have worked without the shared memory thing?

I think the useful apples-to-apples benchmark is currently the Ollama + GreenBoost shim (baseline) (2-5 tps) vs ExLlamaV3 + GreenBoost cache (8–20 tps) comparison.

It would be really useful to see this compared with the existing llama CPU/memory offload. There is a note at the start ("Offload layers to CPU — works, but drops token/s by 5–10× because CPU RAM has no CUDA coherence") - but it is unclear if that 5-10x token speed drop is compared to running a model completely in GPU or compared to the greenboost approach.

I think it is vs GPU, in which case it seems likely the performance is similar to what greenboost is giving but probably much more stable.

[−] yjtpesesu2 59d ago

How does this differ from anything llama.cpp offers, regarding offloading layers? The repo consistently refers to "DDR4". Is there a reason DDR5 won't work with this?

[−] Havoc 59d ago

> The best strategy is to shrink the model until it fits — either with EXL3 quantization or ModelOpt PTQ — and use GreenBoost's DDR4 pool for KV cache only.

Does this make sense? I'd have thought the KV is guaranteed to be used 100% of the time while say in a MoE the same can't be said of the weights.

Though I suppose if you're shooting for huge context then having that allocation go into ram makes sense specially when its allocated but not used yet

[−] ma2kx 59d ago

The physical bottleneck to system memory remains. Therefore, I assume that better results are achieved by manually adjusting which layers are offloaded.

I would prefer to use system memory to cache different models, focusing on things like embedding, rerankers, and TTS. This is sufficient to run a more complex RAG locally, for example, via Mem0, and then use a larger LLM via the cloud.

[−] aruametello 58d ago

Post traumatic "nvidia TurboCache" disorder triggered.

https://en.wikipedia.org/wiki/TurboCache

(Not the same thing 1:1, but worth the joke anyway)

[−] yjftsjthsd-h 59d ago

Previously: https://news.ycombinator.com/item?id=47384557

(Still cool, still would benefit from better benchmarks)

[−] armada651 59d ago

Doesn't Windows already do this by default? I can already run models bigger than my GPU VRAM and it will start using up to 50% of my system RAM as "shared memory". This is on a Desktop PC without a shared memory architecture.

[−] Insanity 59d ago

Extend your VRAM using RAM, then extend your RAM using Swap.

[−] paultendo 59d ago

Could be a very useful way to do some overnight tasks using spare RAM. Possibly things like LLM-based categorisation, labelling, data cleansing. That's what comes to mind for me anyway.

[−] 152334H 58d ago

Nobody mentioning how this project is vibecoded slop?

  > The code is really bad with completely uneeded parts. The LLM (Qwen 2.5 7B) has hardcoded the i9 14700KF topology, and has variables related to it never used... It's even funnier that the show hardware function always prints the same string. There are even random pip log files. Why did this slop got coverage here?

https://www.phoronix.com/forums/forum/linux-graphics-x-org-d...

[−] dwroberts 58d ago

The title here needs changing, this is for nvidia cards but it is not an official project and has nothing to do with them

(Feels especially deceptive when there is another top story right with the headline “nvidia nemoclaw” which is an official project)

[−] ninjagoo 58d ago

This is awesome! Normally, offloading layers to the CPU RAM means that the compute for those layers occurs on the CPU instead of the GPU, generally speaking. The CPU is orders of magnitude slower than the GPU.

With this approach the compute occurs on the GPU, with the tradeoff that layers in RAM have to be moved back-and-forth through PCI-DMA. It seems to me that this should offer a speedup vs compute split between GPU and CPU. The amount of speedup will depend on how many layers would have been on CPU compute, minus the reduction due to moving those layers between RAM and the GPU.

What's slower? Compute on the CPU or moving data from RAM to GPU through PCI-DMA?

[−] wewewedxfgdf 58d ago

Why don't they just put ram slots on the card so you can augment the fast ram

[−] bhewes 59d ago

This has been fun we can task our nemotron-3-super model to run over night when our desktops are idle. 4070s and 96gb of ram works fine. Slow but does it's job.

[−] sabareesh 59d ago

I wish it provided benchmark comparing Direct RAM offload vs CPU offload vs Full VRAM

[−] bguberfain 58d ago

"A watchdog kernel thread monitors RAM and NVMe pressure and signals userspace before things get dangerous." - which kind of danger this type of solution can have?

Nvidia greenboost: transparently extend GPU VRAM using system RAM/NVMe (gitlab.com)

136 comments