Pull to refresh

My first impressions on ROCm and Strix Halo (blog.marcoinacio.com)

by random_ 60 comments 60 points
Read article View on HN

60 comments

[−] spoaceman7777 26d ago
I'm somewhat confused as to why this is on the front page. It doesn't go into any real detail, and the advice it gives is... not good. You should definitely not be quantizing your own gguf's using an old method like that hf script. There are lots of ways to run LLMs via podman (some even officially recommended by the project!). The chip has been out for almost a year now, and its most notable (and relevant-to-AI) feature is not mentioned in this article (it's the only x86_64 chip below workstation/server grade that has quad-channel RAM-- and inference is generally RAM constrained). I'm also quite puzzled about this bit about running pytorch via uv.

Anyway. I wouldn't recommend following the steps posted in there. Poke around google, or ask your friendly neighborhood LLM for some advice on how to set up your Strix Halo laptop/desktop for the tasks described. A good resource to start with would probably be the unsloth page for whichever model you are trying to run. (There are a few quantization groups that are competing for top-place with gguf's, and unsloth is regularly at the top-- with incredible documentation on inference, training, etc.)

Anyway, sorry to be harsh. I understand that this is just a blog for jotting down stuff you're doing, which is a great thing to do. I'm mostly just commenting on the fact that this is on the front page of hn for some reason.

[−] pierrekin 26d ago
Thanks for writing this comment, I think seeing someone’s “first impressions” and then seeing someone else’s response to those thoughts is more interesting and feels more connected socially than just reading a “correct” guide or similar especially when it’s something I’m curious about but wouldn’t necessarily be motivated enough to actually try out myself.
[−] rpdillon 25d ago
Agreed. Been running a Strix Halo box since mid-2025. Lemonade builds of llama.cpp with Unsloth or Bartowski quants have proven to be excellent.
[−] fwipsy 26d ago
Quad-channel RAM is common on consumer desktops. Strix Halo has *8* channels, and also very fast RAM (soldered RAM can be faster than dimms because the traces are shorter.)
[−] fluoridation 26d ago
Quad channel memory is not common on consumer desktops, it's a strictly HEDT and above feature. The vast majority of consumer desktops have 2 channels or fewer.
[−] adrian_b 26d ago
One should no longer use the word "channel" because the width of a channel differs between various kinds of memories, even among those that can be used with the same CPU (e.g. between DDR and LPDDR or between DDR4 and DDR5).

For instance, now the majority of desktops with DDR5 have 4 channels, not 2 channels, but the channels are narrower, so the width of the memory interface is the same as before.

To avoid ambiguities, one should always write the width of the memory interface.

Most desktop computers and laptop computers have 128-bit memory interfaces.

The cheapest desktop computers and laptop computers, e.g. those with Intel Alder Lake N/Twin Lake CPUs, and also many smartphones and Arm-based SBCs, have 64-bit memory interfaces.

Cheaper smartphones and Arm-based SBCs have 32-bit memory interfaces.

Strix Halo and many older workstations and many cheaper servers have 256-bit memory interfaces.

High-end servers and workstations have 768-bit or 512-bit memory interfaces.

It is expected that future high-end servers will have 1024-bit memory interfaces per socket.

GPUs with private memory have usually memory interfaces between 192-bit and 1024-bit, but newer consumer GPUs have usually narrower memory interfaces than older consumer GPUs, to reduce cost. The narrower memory interface is compensated by faster memories, so the available bandwidth in consumer GPUs has been increased much slower than the increase in GDDR memory speed would have allowed.

[−] fluoridation 25d ago

>now the majority of desktops with DDR5 have 4 channels, not 2 channels

Source? I just looked up two random X870E boards from Gigabyte and both are dual channel.

>To avoid ambiguities, one should always write the width of the memory interface.

They're incomparable quantities. More channels support more parallel operations, while a wider bus at a constant frequency supports higher throughput.

The bus width is not even that useful of a metric. It's more useful to talk about bits per second, which is the product of bus width and frequency.

[−] sliken 24d ago
Sadly motherboards, tech journalist, and many other places confuse the difference between a dimm and channel. The trick is the DDR4 generation they were the same, 64 bits wide. However a standard DDR5 dimm is not 1x64 bit, it's actually 2x32 bit. Thus 2 DDR5 dimms = 4 channels.

For some workloads the extra channels help, despite having the same bandwidth. This is one of the reasons that it's possible for a DDR5 system to be slightly faster than a DDR4 system, even if the memory runs at the same speed.

[−] fluoridation 24d ago

>However a standard DDR5 dimm is not 1x64 bit, it's actually 2x32 bit. Thus 2 DDR5 dimms = 4 channels.

Uh, surely that depends on how the motherboard is wired. Just because each DIMM has half the pins on one channel and the other half on another, doesn't mean 2 DIMM = 4 channels. It could just be that the top pins over all the DIMMs are on one channel and the bottom ones are on another.

[−] sliken 24d ago
I think there's a standard wiring for the dimm and some parts are shared. Each normal ddr5 dimm has 2 sub channels that are 32 bits each, and the new specification for the HUDIMM which will only enable 1 sub channel and only have half the bandwidth.

I don't think you can wire up DDR5 dimms willy nilly as if they were 2 separate 32 bit dimms.

[−] fluoridation 24d ago
Well, I don't know what to tell you. I'm not a computer engineer, but I assume Gigabyte has at least a few of those, and they're labeling the X870E boards with 4 DIMMS as "dual channel". I feel like if they were actually quad channel they'd jump at the chance to put a bigger number, so I'm compelled to trust the specs.
[−] sliken 24d ago
In computer manufacture speak dual channel = 2 x 64 bit = 128 bits wide.

So with 2 dimms or 4 you still get 128 bit wide memory. With DDR4 that means 2 channels x 64 bit each. With DDR5 that means 4 channels x 32 bit each.

Keep in mind that memory controller is in the CPU, which is where the DDR4/5 memory controller is. The motherboards job is to connect the right pins on the DIMMs to the right pins on the CPU socket. The days of a off chip memory controller/north bridge are long gone.

So if you look at an AM5 CPU it clearly states:

   * Memory Type: DDR5-only (no DDR4 compatibility).

   * Channels: 2 Channel (Dual-Channel).

   * Memory Width: 2x32-bit sub-channels (128-bit total for 2 sticks).
[−] sliken 24d ago

> Quad-channel RAM is common on consumer desktops

Yes, but tablets, laptops, and normal (non-HEDT) desktops have 4 channels, 4x32 bit = 128 bit wide. Modern memory with DDR5 allows two 32 bit channels on a 64 bit dimm. The previous gen DDR4 would allow 1 64 bit channel on a 64 bit dimm.

So strix halo (on laptops, tablets, and desktops) allows for a 256 bit wide memory system, providing twice the memory bandwidth of any ryzen or intel i3/i5/i7/i9. The Apple pro (256 bit), max (512 bit), and ultra (1024 bit) lines of apple silicon have greater than 128 bit wide memory systems. On the AMD size it's just the Threadripper (256 bit) and Threadripper pro (512 bit), but those are typically in expensive workstations that are physically large, expensive, and need substantial cooling.

So the HALO is pretty unique (outside of Apple) for providing twice the memory bandwidth of anything else that fits in the tablet, laptop, or small desktop category.

[−] phonon 26d ago
4 DIMMS =/= 4 channels
[−] fwipsy 25d ago
I knew that, but I still thought most desktops with 4 dimm slots supported quad-channel memory. I guess I was wrong.
[−] seemaze 26d ago
Check out the officially supported project Lemonade[0] by AMD. It has gfx1151 specific builds of vLLM, llama.cpp, comfy-ui, and even a PR to merge a Strix Halo port of Apple’s MLX[1] with a quick and easy install.

[0] https://www.amd.com/en/developer/resources/technical-article...

[1] https://github.com/lemonade-sdk/lemonade/issues/1642

[−] data-ottawa 25d ago
I don’t think lemonade includes a comfyui wrapper, it does have stable diffusion support built in though.
[−] suprjami 26d ago
If you are using quants below Q8 then get them from Unsloth or Bartowski.

They are higher quality than the quants you can make yourself due to their imatrix datasets and selective quantisation of different parts of the model.

For Qwen 3.5 Unsloth did 9 terabytes of quants to benchmark the effects of this:

https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

[−] sidkshatriya 26d ago

> It seems that things wouldn't work without a BIOS update: PyTorch was unable to find the GPU. This was easily done on the BIOS settings: it was able to connect to my Wifi network and download it automatically.

Call me traditional but I find it a bit scary for my BIOS to be connecting to WiFi and doing the downloading. Makes me wonder if the new BIOS blob would be secure i.e. did the BIOS connect over securely over https ? Did it check for the appropriate hash/signature etc. ? I would suppose all this is more difficult to do in the BIOS. I would expect better security if this was done in user space in the OS.

I'm much prefer if the OS did the actual downloading followed by the BIOS just doing the installation of the update.

[−] anko 26d ago
I would be interested to know what speeds you can get from gemma4 26b + 31b from this machine. also how rocm compares to triton.
[−] bityard 26d ago
If you just want to run models, most of TFA is taking the scenic route.

All you really need is podman, toolbx, and the Strix Halo toolbox images from https://github.com/kyuz0/amd-strix-halo-toolboxes. Then you just download your ggufs and hand them to llama-server.

Yes, there are other solutions that are a bit more hand-holdy, but if you already know how to use docker/podman and just want to get something working in an evening, this works too.

[−] everlier 26d ago
owning GGUF conversion step is good in sone circumstances, but running in fp16 is below optimal for this hardware due to low-ish bandwidth.

It looks like context is set to 32k which is the bare minimum needed for OpenCode with its ~10k initial system prompt. So overall, something like Unsloth's UD q8 XL or q6 XL quants free up a lot of memory and bandwidth moving into the next tier of usefulness.

[−] thr3at-surfac3 25d ago
The unified memory architecture is what makes Strix Halo interesting for inference workloads. No PCIe bottleneck moving weights between CPU and GPU memory. For anyone getting started, the Unsloth UD quants are the way to go their imatrix calibration makes a real difference in output quality at Q6/Q8 compared to naive quantization. Curious about the ROCm vs Vulkan situation though. Has anyone benchmarked the prompt processing speed difference? For agentic workflows where you're constantly feeding new context, first-token latency matters more than raw tok/s.
[−] IamTC 26d ago
Nice. Thanks for the writeup. My Strix Halo machine is arriving next week. This is handy and helpful.
[−] roenxi 26d ago
I thought the point of something like Strix Halo was to avoid ROCm all together? AMDs strategy seems to have been to unify GPU/CPU memory then let people write their own libraries.

The industry looks like it's started to move towards Vulkan. If AMD cards have figured out how to reliably run compute shaders without locking up (never a given in my experience, but that was some time ago) then there shouldn't be a reason to use speciality APIs or software written by AMD outside of drivers.

ROCm was always a bit problematic, but the issue was if AMD card's weren't good enough for AMD engineers to reliably support tensor multiplication then there was no way anyone else was going to be able to do it. It isn't like anyone is confused about multiplying matricies together, it isn't for everyone but the naive algorithm is a core undergrad topic and the advanced algorithms surely aren't that crazy to implement. It was never a library problem.

[−] data-ottawa 25d ago
Linux kernel 7 enables the NPU on Linux. You can use fastflowLM with lemonade now.

It is quite slow, but if you want to compute embeddings in the background it’s fine.

I didn’t find it more energy efficient than just using the GPU for time insensitive tasks though.

[−] timmy777 26d ago
Thanks for sharing. However, this missed being a good writeup due to lack of numbers and data.

I'll give a specific example in my feedback, You said:

`` so far, so good, I was able to play with PyTorch and run Qwen3.6 on llama.cpp with a large context window ``

But there are no numbers, results or output paste. Performance, or timings.

Anyone with ram can run these models, it will just be impracticably slow. The halo strix is for a descent performance, so you sharing numbers will be valuable here.

Do you mind sharing these? Thanks!

[−] JSR_FDED 26d ago
Perfect. No fluff, just the minimum needed to get things working.
[−] aappleby 26d ago
No benchmarks?