Why is there a new kernel driver here at all? It appears that all it does it allocate system RAM (“DDR4”) and export it as a dmabuf for import to cuda as mapped external memory. Then a userspace shim hijacks APIs to use that if gpu memory is full. cuda already supports allocating mapped system memory, so AFAICT this could be implemented in the userspace shim with no new kernel driver.
Also as other commenters have mentioned, redirecting allocations to managed memory would also enable similar oversubscription
And the hijack approach only makes sense for making apps have this behavior with no changes, and could be done with minor app changes (e.g. PyTorch has a pluggable allocator interface). App changes also enable intentionally placing specific allocations.
My impression is that this is vibe from beginning to end, starting from a design that only makes sense if you are hallucinating
One downside is your kernel isn't going to reserve that memory away from userland. You will still see all the memory at system level as "free". As the GPU driver starts using it, other apps/the OS will try to use the "free" memory, not knowing how much of it is in use (it may show up as "cache", or not at all). Then OOM killer starts going or programs start crashing, and at some point the OS tips over or GPU driver crashes. You can add loads of swap as a compromise and it works okay, if a bit slow.
In any case, loading a gigantic model just to use system RAM is absurdly slow (due to mem bandwidth), like 1-5 t/s, so it's not practical. It'd take a whole day to process one 86k token request. Just pay a cloud provider $0.01 to do it in 10 seconds.
The point is not how fast it is now. The point is that this opens new possibilities that can be built on. Potentially models that are trained with slightly different architectures to optimize to this use case. Possibly others come to improve this path. Possibly HW manufacturers make a few small adjustments that remove bottlenecks. Who knows, the next person may combine CPU compute with this mem sharing to get another token a second. Then the next person does predictive loading into memory to keep that bandwith 100% maxed and usable. Then the next does and the next does. Before you know it there is a real thing there that never existed.
This is a great project. I love the possibilities it hints at. Thanks for building it!
It’s architecturally not a good approach. System RAM is much slower so you should put data that doesn’t need to be used often on it. That knowledge is at the application layer. Adding a CUDA shim makes system RAM appear like VRAM, which gets things to run, but it will never run very well.
The benchmarks at the bottom mention memory tiering and manually controlling where things go, but if your application already does that, then you probably don’t also need a CUDA shim. The application should control the VRAM to system memory transfers with boring normal code.
Not true for unified systems. And for strix halo you need to dedicate the amount which is annoying.
You’re basically stating that swapping is also a bad idea.
And to take it further, any memory or storage is a bad idea because there’s L1 cache/SRAM which is faster then the rest
Yes, with current LLMs and current hardware and current supporting software this is a true statement. My point wasn't that this approach suddenly changes that, it was that it makes it easier to explore alternatives that might change that. Let's imagine some possibilities:
- Models that use a lot of weight reuse: If you strategically reuse layers 3-4x that could give a lot of time for async loading of future weights.
- Models that select experts for several layers at a time: Same thing, while crunching on the current layer you have teed-up future layers that can be transferring in
- HW makers start improving memory bandwidth: This is already happening right? AMD and Apple are pushing unified memory architectures with much higher bandwidth but still not quite there compared to GPUs. This could lead to a hybrid approach that makes those machines much more competitive. similarly, HW makers could bring back technologies that died on the vine that could help, things like Intel's optaine come to mind. Start making mass storage as fast as system memory is now and the equation may change.
These are quick dart throws that probably have obvious holes in them but the point is platforms like this help us explore paths that appeared dead-end until that one change makes them viable and then allows them to take over. It may not happen. It may be a dead end. But that logic means we will never go out on a limb and try something new. We need people and tech that challenges assumptions and makes it easy for people to try out ideas to keep the tech ecosystem evolving. This does that. Even if this particular project doesn't succeed it is a great thing to do if for no other reason it likely just spurred a bunch of people to try their own crazy hacks for LLM inference. Maybe it even enabled a use case with GPUs that nobody realized existed and has nothing to do with LLMs.
With discrete GPUs, using system RAM is slow not due to mem bandwidth, but due to PCIe bandwidth, which is the bottleneck.
For example, 16x PCIe 4.0: 256 Gb/s, 16x PCIe 5.0: 512 Gb/s, while 2x DDR5-6400 DIMMs: 819 Gb/s. The actual throughput is lower for both PCIe and DDR5, due to communication overhead.
On server/workstation motherboards which may have 4, 8 or 12 DIMMs instead of 2, the ratio between memory bandwidth and PCIe bandwidth becomes proportionally higher, so the memory throughput achievable by the GPU becomes a very small fraction of the system memory bandwidth.
> any case, loading a gigantic model just to use system RAM is absurdly slow (due to mem bandwidth), like 1-5 t/s, so it's not practical. It'd take a whole day to process one 86k token reques
So don't use it for large requests. Ideal for when you just want to categorise things, for example, "does this task need a shell" or "bucket this email into one of help request, bill due or personal comms".
This is really interesting engineering, but I agree with the other commentators that the benchmarking makes it hard to understand the contribution various factors are having.
The ExLlamaV3 EXL3 2bpw (8 GB, full VRAM) row is an order of magnitude faster than the baseline - but the baseline seems to be the 32GB model running with the KV cache shared to system memory only (I think?)
But if a 8GB model gives sufficient quality then it seems like that would have worked without the shared memory thing?
I think the useful apples-to-apples benchmark is currently the Ollama + GreenBoost shim (baseline) (2-5 tps) vs ExLlamaV3 + GreenBoost cache (8–20 tps) comparison.
It would be really useful to see this compared with the existing llama CPU/memory offload. There is a note at the start ("Offload layers to CPU — works, but drops token/s by 5–10× because CPU RAM has no CUDA coherence") - but it is unclear if that 5-10x token speed drop is compared to running a model completely in GPU or compared to the greenboost approach.
I think it is vs GPU, in which case it seems likely the performance is similar to what greenboost is giving but probably much more stable.
How does this differ from anything llama.cpp offers, regarding offloading layers? The repo consistently refers to "DDR4". Is there a reason DDR5 won't work with this?
> The best strategy is to shrink the model until it fits — either with EXL3 quantization or ModelOpt PTQ — and use GreenBoost's DDR4 pool for KV cache only.
Does this make sense? I'd have thought the KV is guaranteed to be used 100% of the time while say in a MoE the same can't be said of the weights.
Though I suppose if you're shooting for huge context then having that allocation go into ram makes sense specially when its allocated but not used yet
The physical bottleneck to system memory remains. Therefore, I assume that better results are achieved by manually adjusting which layers are offloaded.
I would prefer to use system memory to cache different models, focusing on things like embedding, rerankers, and TTS. This is sufficient to run a more complex RAG locally, for example, via Mem0, and then use a larger LLM via the cloud.
Doesn't Windows already do this by default? I can already run models bigger than my GPU VRAM and it will start using up to 50% of my system RAM as "shared memory". This is on a Desktop PC without a shared memory architecture.
Could be a very useful way to do some overnight tasks using spare RAM. Possibly things like LLM-based categorisation, labelling, data cleansing. That's what comes to mind for me anyway.
Nobody mentioning how this project is vibecoded slop?
> The code is really bad with completely uneeded parts. The LLM (Qwen 2.5 7B) has hardcoded the i9 14700KF topology, and has variables related to it never used... It's even funnier that the show hardware function always prints the same string. There are even random pip log files. Why did this slop got coverage here?
This is awesome! Normally, offloading layers to the CPU RAM means that the compute for those layers occurs on the CPU instead of the GPU, generally speaking. The CPU is orders of magnitude slower than the GPU.
With this approach the compute occurs on the GPU, with the tradeoff that layers in RAM have to be moved back-and-forth through PCI-DMA. It seems to me that this should offer a speedup vs compute split between GPU and CPU. The amount of speedup will depend on how many layers would have been on CPU compute, minus the reduction due to moving those layers between RAM and the GPU.
What's slower? Compute on the CPU or moving data from RAM to GPU through PCI-DMA?
This has been fun we can task our nemotron-3-super model to run over night when our desktops are idle. 4070s and 96gb of ram works fine. Slow but does it's job.
"A watchdog kernel thread monitors RAM and NVMe pressure and signals userspace before things get dangerous." - which kind of danger this type of solution can have?
136 comments
Also as other commenters have mentioned, redirecting allocations to managed memory would also enable similar oversubscription
And the hijack approach only makes sense for making apps have this behavior with no changes, and could be done with minor app changes (e.g. PyTorch has a pluggable allocator interface). App changes also enable intentionally placing specific allocations.
My impression is that this is vibe from beginning to end, starting from a design that only makes sense if you are hallucinating
Or, as you said, making everything backwards compatible that is not being regularly updated
In any case, loading a gigantic model just to use system RAM is absurdly slow (due to mem bandwidth), like 1-5 t/s, so it's not practical. It'd take a whole day to process one 86k token request. Just pay a cloud provider $0.01 to do it in 10 seconds.
This is a great project. I love the possibilities it hints at. Thanks for building it!
The benchmarks at the bottom mention memory tiering and manually controlling where things go, but if your application already does that, then you probably don’t also need a CUDA shim. The application should control the VRAM to system memory transfers with boring normal code.
You’re basically stating that swapping is also a bad idea. And to take it further, any memory or storage is a bad idea because there’s L1 cache/SRAM which is faster then the rest
> It’s architecturally not a good approach.
Yes, with current LLMs and current hardware and current supporting software this is a true statement. My point wasn't that this approach suddenly changes that, it was that it makes it easier to explore alternatives that might change that. Let's imagine some possibilities:
- Models that use a lot of weight reuse: If you strategically reuse layers 3-4x that could give a lot of time for async loading of future weights.
- Models that select experts for several layers at a time: Same thing, while crunching on the current layer you have teed-up future layers that can be transferring in
- HW makers start improving memory bandwidth: This is already happening right? AMD and Apple are pushing unified memory architectures with much higher bandwidth but still not quite there compared to GPUs. This could lead to a hybrid approach that makes those machines much more competitive. similarly, HW makers could bring back technologies that died on the vine that could help, things like Intel's optaine come to mind. Start making mass storage as fast as system memory is now and the equation may change.
These are quick dart throws that probably have obvious holes in them but the point is platforms like this help us explore paths that appeared dead-end until that one change makes them viable and then allows them to take over. It may not happen. It may be a dead end. But that logic means we will never go out on a limb and try something new. We need people and tech that challenges assumptions and makes it easy for people to try out ideas to keep the tech ecosystem evolving. This does that. Even if this particular project doesn't succeed it is a great thing to do if for no other reason it likely just spurred a bunch of people to try their own crazy hacks for LLM inference. Maybe it even enabled a use case with GPUs that nobody realized existed and has nothing to do with LLMs.
For example, 16x PCIe 4.0: 256 Gb/s, 16x PCIe 5.0: 512 Gb/s, while 2x DDR5-6400 DIMMs: 819 Gb/s. The actual throughput is lower for both PCIe and DDR5, due to communication overhead.
On server/workstation motherboards which may have 4, 8 or 12 DIMMs instead of 2, the ratio between memory bandwidth and PCIe bandwidth becomes proportionally higher, so the memory throughput achievable by the GPU becomes a very small fraction of the system memory bandwidth.
> any case, loading a gigantic model just to use system RAM is absurdly slow (due to mem bandwidth), like 1-5 t/s, so it's not practical. It'd take a whole day to process one 86k token reques
So don't use it for large requests. Ideal for when you just want to categorise things, for example, "does this task need a shell" or "bucket this email into one of help request, bill due or personal comms".
"I turned a $95 AMD APU into a 16GB VRAM GPU and it can run stable diffusion!"
The ExLlamaV3 EXL3 2bpw (8 GB, full VRAM) row is an order of magnitude faster than the baseline - but the baseline seems to be the 32GB model running with the KV cache shared to system memory only (I think?)
But if a 8GB model gives sufficient quality then it seems like that would have worked without the shared memory thing?
I think the useful apples-to-apples benchmark is currently the Ollama + GreenBoost shim (baseline) (2-5 tps) vs ExLlamaV3 + GreenBoost cache (8–20 tps) comparison.
It would be really useful to see this compared with the existing llama CPU/memory offload. There is a note at the start ("Offload layers to CPU — works, but drops token/s by 5–10× because CPU RAM has no CUDA coherence") - but it is unclear if that 5-10x token speed drop is compared to running a model completely in GPU or compared to the greenboost approach.
I think it is vs GPU, in which case it seems likely the performance is similar to what greenboost is giving but probably much more stable.
> The best strategy is to shrink the model until it fits — either with EXL3 quantization or ModelOpt PTQ — and use GreenBoost's DDR4 pool for KV cache only.
Does this make sense? I'd have thought the KV is guaranteed to be used 100% of the time while say in a MoE the same can't be said of the weights.
Though I suppose if you're shooting for huge context then having that allocation go into ram makes sense specially when its allocated but not used yet
I would prefer to use system memory to cache different models, focusing on things like embedding, rerankers, and TTS. This is sufficient to run a more complex RAG locally, for example, via Mem0, and then use a larger LLM via the cloud.
https://en.wikipedia.org/wiki/TurboCache
(Not the same thing 1:1, but worth the joke anyway)
(Still cool, still would benefit from better benchmarks)
(Feels especially deceptive when there is another top story right with the headline “nvidia nemoclaw” which is an official project)
With this approach the compute occurs on the GPU, with the tradeoff that layers in RAM have to be moved back-and-forth through PCI-DMA. It seems to me that this should offer a speedup vs compute split between GPU and CPU. The amount of speedup will depend on how many layers would have been on CPU compute, minus the reduction due to moving those layers between RAM and the GPU.
What's slower? Compute on the CPU or moving data from RAM to GPU through PCI-DMA?