> Apple Silicon changes the physics. The CPU and GPU share the same physical memory (Apple's Unified Memory Architecture) ... no bus!
Beware the reality distortion field: This is of course how it's worked on most x86 machines for a long time. And also on most Macs when they were using Intel chips.
Why did all my x86 onboard iGPU reserve a fixed amount of RAM on boot, inaccessible to the OS? Why do dGPU bring their own VRAM and how to directly manipulate it from the CPU without copying?
Correct me if I'm wrong, but that reserved memory is for the framebuffer? The iBoot bootloader also reserves some memory for the framebuffer.
dGPUs bring their own VRAM because it's a different type of memory, allowing them to get higher performance than they could with DDR. The M4 Max requires 128GB of LPDDR5X to reach its ~500GB/s bandwidth. The RX Vega 64 had that same bandwidth in 2017 with just 8GB of HBM2.
Nope, the reserved memory is what's available to use from the various APIs (VK, GL, etc). More recently there's OS support for flexible on demand allocation by the GPU driver.
Of course the APIs have allowed you to make direct use of pointers to CPU memory for something like a decade. However that requires maintaining two separate code paths because doing so while running on a dGPU is _extremely_ expensive.
As someone that's worked on GPU drivers for shared memory systems for over 15 years, supporting hardware that was put on the market over 20 years ago, and they've "always" (in my experience) been able to dynamically assign memory pages to the GPU.
The "reserved" memory is more about the guaranteed minimum to allow the thing to actually light up, and sometimes specific hardware blocks had more limited requirements (e.g. the display block might require contiguous physical addresses, or the MMU data/page tables themselves) so we would reserve a chunk to ensure they can actually be allocated with those requirements. But they tended to be a small proportion of the total "GPU Memory used".
Sure, sharing the virtual address space is less well supported, but the total amount of memory the GPU can use is flexible at runtime.
To the first question: blame Windows I guess. But even on older chips, GPU code could access memory allocated on the CPU side so this didn't cap the amount of data your GPGPU code could crunch.
I remember this was mostly a BIOS setting how much memory to allocate for iGPU - and once set in the BIOS, that memory was not accessible to the underlying OS (besides GPU I/O).
Agree, maybe "changes the physics" was too strong, shared cpu/gpu memory is not new.
What is different then is the combination of
1. UMA memory (and yes, iGPU had this, pre-M1)
2. enough bandwidth / GPU throughput for local inference
3. straightforward makeBuffer(bytesNoCopy:) path
So, the novelty isn't the shared memory itself, but the whole chain lining up to make the Wasm linear memory -> Metal-buffer approach practical + performant enough.
(and not saying there's some Apple Silicon magic here either ... it'd work anywhere there was UMA and no-copy host-pointer path)
Apple Silicon uses unified memory where the CPU and GPU use the exact same memory and no copies from RAM to VRAM are needed. The article opens with mentioning just that and indeed it is the whole point of the article.
I am always a bit baffled why Apple gets credited with this. Unified memory has been a thing for decades. I can still load the biggest models on my 10th gen Intel Core CPU and the integrated GPU can run inference.
The difference being that modern integrated GPU are just that much faster and can run inference at tolerable speeds.
(Plus NPUs being a thing now, but that also started much earlier. Thr 10th gen Intel Core architecture already had instructions to deal with "AI" workloads... just very preliminary)
That’s shared, not unified, it’s partitioned where cpu and gpu copies are managed by driver. Lunar lake (2024) is getting closer but still not as tightly integrated as apple and capped to 32GB only (Apple has up to 512GB). AMD ryzen ai max is closer to Apple but still 3 times slower memory.
Shared vs unified is merely a driver implementation detail. Regardless, in practice (IIUC) data is still going to be copied if you perform a transfer using a graphics API because the driver has no way of knowing what the host might do with the pointed-to memory after the transfer.
If you make use of host pointers and run on an iGPU no copy will take place.
My last serious GPU programming was with OpenCL. And if my memory does not fail me the API was quite specific about copying and/or sharing memory on a shared memory system.
I am pretty sure that my old 10th gen CPU/GPU combo has the ability to use the "unified"/zero-copy access mode for the GPU.
I don't think people are crediting Apple with inventing unified memory - I certainly did not. There have been similar systems for decades. What Apple did is popularize this with widely available hardware with GPUs that don't totally suck for inference in combination with RAM that has decent speed at an affordable price. You either had iGPUs which were slow (plus not exactly the fastest DDR memory) but at least sitting on the same die or you had fast dGPUs which had their own limited amount of VRAM. So the choice was between direct memory access but not powerfull or powerfull but strangled by having to go through the PCIE subsystem to access RAM.
The article is talking about one particular optimization that one can implement with Apple Silicon and I at least wasn't aware that it is now possible to do so from WebAssembly - so to completely dismiss it as if it had nothing to do with Apple Silicon is imho not fair.
> on Apple Silicon, a WebAssembly module's linear memory can be shared directly with the GPU: no copies, no serialization, no intermediate buffers
enhance
> no copies, no serialization, no intermediate buffers
would it kill people to write their own stuff why are we doing this. out of all the things people immediately cede to AI they cede their human ability to communicate and convey/share ideas. this timeline is bonkers.
I’ve become overly sensitive to it as well because it’s such a reliable indicator that there are other problems in the work.
I’ve wasted so much time looking at interesting repos this year before discovering that one of the main claims was a hallucination, or that when I got to the specific part of the codebase it just had a big note from the LLM that’s it’s a placeholder until it can figure out how to do the requested thing.
The people who have AI write their articles don’t care if it works or if it’s correct. They’re trying to get jobs and want something quick and interesting that will appeal to a lazy hiring manager. We’re just taking the bait too.
> The people who have AI write their articles don’t care if it works or if it’s correct.
I'd build on this: The people who have AI write their articles very likely don't know how their thing works or is correct. High chance they'll stumble when they are expected to speak about whatever it is they are presenting with some authority and demonstration of knowledge. Human to human, not being able to do that = obliterates trust. Places it somewhere near the realm of misinformation, which everyone unilaterally has no interest in consuming.
Good luck to people who want to fluff expertise and present as more-capable for job prospects, the world is shit and I know there's more people who need income than there are jobs that provide for our basic human needs, but this level of AI crutching is just going to bode poorly for those who think this is going to get them where they need to go.
Well, there is a long tradition of "testing" developer candidates by asking them to exhibit skills in tasks that they never, ever, do in their work. Like whiteboard coding.
It doesn't have a great success record.
I personally would rather they exhibited expert skills in using tools, and expressing their design insight as a part of that skillset.
Huh, I’m 100% going to interview this way the next time I have to hire an engineer. I can’t think of a better way to get a sense of how a candidate reasons about things, and of their values - do they have a sense of responsibility, conscientiousness, team fit.
All other things that could be LLM-mediated have no more signal.
I don't know, to me your sentiment sounds a lot like how back in the day they used to say "you can't just use a calculator all the time, use your brain and show the work on pen and paper".
humans have been using tools to communicate since pre-history. language itself is one tool of communication invented to supersede body-language and grunting and noises. the thought and idea is theirs, it was communicated. Would it be that much different if they used a spellchecker extensively to edit their work?
I get why you're annoyed but is it really such a big deal? random people aren't to blame for whatever other annoyances "AI slop" has created.
On one side it sounds promising to exploit shared memory properties to speed up inference. But on the other hand, the well established inference engines are perhaps already well optimized to overlap compute and communication efficiently. In this case the host-device copies are likely not a problem to tackle.
53 comments
> Apple Silicon changes the physics. The CPU and GPU share the same physical memory (Apple's Unified Memory Architecture) ... no bus!
Beware the reality distortion field: This is of course how it's worked on most x86 machines for a long time. And also on most Macs when they were using Intel chips.
dGPUs bring their own VRAM because it's a different type of memory, allowing them to get higher performance than they could with DDR. The M4 Max requires 128GB of LPDDR5X to reach its ~500GB/s bandwidth. The RX Vega 64 had that same bandwidth in 2017 with just 8GB of HBM2.
Of course the APIs have allowed you to make direct use of pointers to CPU memory for something like a decade. However that requires maintaining two separate code paths because doing so while running on a dGPU is _extremely_ expensive.
The "reserved" memory is more about the guaranteed minimum to allow the thing to actually light up, and sometimes specific hardware blocks had more limited requirements (e.g. the display block might require contiguous physical addresses, or the MMU data/page tables themselves) so we would reserve a chunk to ensure they can actually be allocated with those requirements. But they tended to be a small proportion of the total "GPU Memory used".
Sure, sharing the virtual address space is less well supported, but the total amount of memory the GPU can use is flexible at runtime.
What is different then is the combination of
1. UMA memory (and yes, iGPU had this, pre-M1) 2. enough bandwidth / GPU throughput for local inference 3. straightforward
makeBuffer(bytesNoCopy:)pathSo, the novelty isn't the shared memory itself, but the whole chain lining up to make the Wasm linear memory -> Metal-buffer approach practical + performant enough.
(and not saying there's some Apple Silicon magic here either ... it'd work anywhere there was UMA and no-copy host-pointer path)
The value would be in actor processes, where you can delegate inference without paying the 'copy tax' for crossing the sandbox boundary.
So, less "inference engine" and more "Tmux for AI agents"
Think pausing, moving, resuming, swapping model backend.
I scoped the post to memory architecture, since it was the least obvious part ... will follow up with one about the actor model aspect.
The whole Apple Silicon thing is (in this case) just added details that don't actually matter.
[1] https://github.com/WebAssembly/memory-control/blob/main/prop...
The difference being that modern integrated GPU are just that much faster and can run inference at tolerable speeds.
(Plus NPUs being a thing now, but that also started much earlier. Thr 10th gen Intel Core architecture already had instructions to deal with "AI" workloads... just very preliminary)
If you make use of host pointers and run on an iGPU no copy will take place.
I am pretty sure that my old 10th gen CPU/GPU combo has the ability to use the "unified"/zero-copy access mode for the GPU.
The article is talking about one particular optimization that one can implement with Apple Silicon and I at least wasn't aware that it is now possible to do so from WebAssembly - so to completely dismiss it as if it had nothing to do with Apple Silicon is imho not fair.
And yes things like the Amiga Blitter, arcade or console graphics units were already baby GPUs.
That's the same no matter the physical memory system architecture.
> on Apple Silicon, a WebAssembly module's linear memory can be shared directly with the GPU: no copies, no serialization, no intermediate buffers
enhance
> no copies, no serialization, no intermediate buffers
would it kill people to write their own stuff why are we doing this. out of all the things people immediately cede to AI they cede their human ability to communicate and convey/share ideas. this timeline is bonkers.
I’ve wasted so much time looking at interesting repos this year before discovering that one of the main claims was a hallucination, or that when I got to the specific part of the codebase it just had a big note from the LLM that’s it’s a placeholder until it can figure out how to do the requested thing.
The people who have AI write their articles don’t care if it works or if it’s correct. They’re trying to get jobs and want something quick and interesting that will appeal to a lazy hiring manager. We’re just taking the bait too.
> The people who have AI write their articles don’t care if it works or if it’s correct.
I'd build on this: The people who have AI write their articles very likely don't know how their thing works or is correct. High chance they'll stumble when they are expected to speak about whatever it is they are presenting with some authority and demonstration of knowledge. Human to human, not being able to do that = obliterates trust. Places it somewhere near the realm of misinformation, which everyone unilaterally has no interest in consuming.
Good luck to people who want to fluff expertise and present as more-capable for job prospects, the world is shit and I know there's more people who need income than there are jobs that provide for our basic human needs, but this level of AI crutching is just going to bode poorly for those who think this is going to get them where they need to go.
There will be a time where it will be problematic for those who over-rely on AI and will struggle on on-site interviews with whiteboard tests.
It doesn't have a great success record.
I personally would rather they exhibited expert skills in using tools, and expressing their design insight as a part of that skillset.
All other things that could be LLM-mediated have no more signal.
humans have been using tools to communicate since pre-history. language itself is one tool of communication invented to supersede body-language and grunting and noises. the thought and idea is theirs, it was communicated. Would it be that much different if they used a spellchecker extensively to edit their work?
I get why you're annoyed but is it really such a big deal? random people aren't to blame for whatever other annoyances "AI slop" has created.
Also, these folks should be amazed by 8 and 16 bit games development, or games consoles in general.