Flash-MoE: Running a 397B Parameter Model on a Laptop (github.com)

by mft_ 120 comments 398 points
Read article View on HN

120 comments

[−] tarruda 55d ago
Note that this is not the only way to run Qwen 3.5 397B on consumer devices, there are excellent ~2.5 BPW quants available that make it viable for 128G devices.

I've had great success (~20 t/s) running it on a M1 Ultra with room for 256k context. Here are some lm-evaluation-harness results I ran against it:

    mmlu: 87.86%

    gpqa diamond: 82.32%

    gsm8k: 86.43%

    ifeval: 75.90%
More details of my experience:

- https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discu...

- https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discu...

- https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...

Overall an excellent model to have for offline inference.

[−] Aurornis 55d ago
The method in this link is already using a 2-bit quant. They also reduced the number of experts per token from 10 to 4 which is another layer of quality degradation.

In my experience the 2-bit quants can produce output to short prompts that makes sense but they aren’t useful for doing work with longer sessions.

This project couldn’t even get useful JSON out of the model because it can’t produce the right token for quotes:

> *2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable.

[−] tarruda 55d ago
I can't say anything about the OP method, but I already tested the smol-IQ2_XS quant (which has 2.46 BPW) with the pi harness. I did not do a very long session because token generation and prompt processing gets very slow, but I think I worked for up to ~70k context and it maintained a lot of coherence in the session. IIRC the GPQA diamond is supposed to exercise long chains of thought and it scored exceptionally well with 82% (the original BF16 official number is 88%: https://huggingface.co/Qwen/Qwen3.5-397B-A17B).

Note that not all quants are the same at a certain BPW. The smol-IQ2_XS quant I linked is pretty dynamic, with some tensors having q8_0 type, some q6_k and some q4_k (while the majority is iq2_xs). In my testing, this smol-IQ2_XS quant is the best available at this BPW range.

Eventually I might try a more practical eval such as terminal bench.

[−] Aurornis 55d ago

> I did not do a very long session

This is always the problem with the 2-bit and even 3-bit quants: They look promising in short sessions but then you try to do real work and realize they’re a waste of time.

Running a smaller dense model like 27B produces better results than 2-bit quants of larger models in my experience.

[−] amelius 55d ago

> This is always the problem with the 2-bit and even 3-bit quants: They look promising in short sessions but then you try to do real work and realize they’re a waste of time.

It would be nice to see a scientific assessment of that statement.

[−] singpolyma3 55d ago
Lots of people seem to use 4bit. Do you think that's worth it vs a smaller model in some cases?
[−] Aurornis 55d ago
4 bit is as low as I like to go. There are KLD and perplexity tests that compare quantizations where you can see the curve of degradation, but perplexity and KLD numbers can be misleading compared to real world use where small errors compound over long sessions.

In my anecdotal experience I’ve been happier with Q6 and dealing with the tradeoffs that come with it over Q4 for Qwen3.5 27B.

[−] hnfong 55d ago
Generally the perplexity charts indicate that quality drops significantly below 4-bit, so in that sense 4-bit is the sweet spot if you're resource constrained.
[−] simonw 55d ago
The project doesn't just use 2-bit - that was one of the formats they tried, but when that didn't give good tool calls they switched to 4-bit.
[−] tarruda 55d ago
In my case it the 2.46BPW has been working flawless for tool calling, so I don't think 2-bit was the culprit for JSON failing.

They did reduce the number of experts, so maybe that was it?

[−] stuaxo 54d ago
There's at least one project they could use to repair the JSON and another that work takes a different approach.
[−] arjie 55d ago
What's the tok/s you get these days? Does it actually work well when you use more of that context?

By the way, it's been a long time since I last saw your username. You're the guy who launched Neovim! Boy what a success. Definitely the Kickstarter/Bountysource I've been a tiny part of that had the best outcome. I use it every day.

[−] tarruda 55d ago

> What's the tok/s you get these days?

I ran llama-bench a couple of weeks ago when there was a big speed improvement on llama.cpp (https://github.com/ggml-org/llama.cpp/pull/20361#issuecommen...):

    % llama-bench -m ~/ml-models/huggingface/ubergarm/Qwen3.5-397B-A17B-GGUF/smol-IQ2_XS/Qwen3.5-397B-A17B-smol-IQ2_XS-00001-of-00004.gguf -fa 1 -t 1 -ngl 99 -b 2048 -ub 2048 -d 0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,150000,200000,250000
    ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
    ggml_metal_library_init: using embedded metal library
    ggml_metal_library_init: loaded in 0.008 sec
    ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
    ggml_metal_device_init: GPU name:   MTL0
    ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
    ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
    ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
    ggml_metal_device_init: simdgroup reduction   = true
    ggml_metal_device_init: simdgroup matrix mul. = true
    ggml_metal_device_init: has unified memory    = true
    ggml_metal_device_init: has bfloat            = true
    ggml_metal_device_init: has tensor            = false
    ggml_metal_device_init: use residency sets    = true
    ggml_metal_device_init: use shared buffers    = true
    ggml_metal_device_init: recommendedMaxWorkingSetSize  = 134217.73 MB
    | ------------------------------ | ---------: | ---------: | ---------- | ------: | -------: | -: | --------------: | -------------------: |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |           pp512 |        189.67 ± 1.98 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |           tg128 |         19.98 ± 0.01 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d10000 |        168.92 ± 0.55 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d10000 |         18.93 ± 0.02 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d20000 |        152.42 ± 0.22 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d20000 |         17.87 ± 0.01 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d30000 |        139.37 ± 0.28 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d30000 |         17.12 ± 0.01 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d40000 |        128.38 ± 0.33 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d40000 |         16.38 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d50000 |        118.07 ± 0.55 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d50000 |         15.66 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d60000 |        108.44 ± 0.38 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d60000 |         14.98 ± 0.01 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d70000 |         98.85 ± 0.18 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d70000 |         14.36 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d80000 |         91.39 ± 0.49 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d80000 |         13.84 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d90000 |         85.76 ± 0.24 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d90000 |         13.30 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d100000 |         80.19 ± 0.83 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d100000 |         12.82 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d150000 |         54.46 ± 0.33 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d150000 |         10.17 ± 0.09 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d200000 |         47.05 ± 0.15 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d200000 |          9.04 ± 0.02 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d250000 |         40.71 ± 0.26 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d250000 |          8.01 ± 0.02 |

    build: d28961d81 (8299)
So it starts at 20 tps tg and 190 tps pp with empty context and ends at 8 tps tg and 40 tps pp with 250k prefill.

I suspect that there are still a lot of optimizations to be implemented for Qwen 3.5 on llama.cpp, wouldn't be surprised to reach 25 tps in a few months.

> You're the guy who launched Neovim!

That's me ;D

> I use it every day.

So do I for the past 12 years! Though I admit in the past year I greatly reduced the amount of code I write by hand :/

[−] hnfong 55d ago
Apologies to others for the offtopic comment, but thank you so much for neovim. I started using Vim 25 years ago and I almost don't know how to type without a proper Vi-based editor. I don't write as much code these days, but I write other stuff (which definitely needs to be mostly hand written) in neovim and I feel so grateful that this tool is still receiving love and getting new updates.
[−] tarruda 55d ago

> in neovim and I feel so grateful that this tool is still receiving love and getting new updates.

@justinmk deserves the credit for this!

[−] terhechte 55d ago
Thank you for NeoVim! I also use it every day, mostly for thinking / text / markdown though these days.

Have you compared against MLX? Sometimes I’m getting much faster responses but it feels like the quality is worse (eg tool calls not working, etc)

[−] tarruda 55d ago

> Have you compared against MLX?

I don't think MLX supports similar 2-bit quants, so I never tried 397B with MLX.

However I did try 4-bit MLX with other Qwen 3.5 models and yes it is significantly faster. I still prefer llama.cpp due to it being a one in all package:

- SOTA dynamic quants (especially ik_llama.cpp) - amazing web ui with MCP support - anthropic/openai compatible endpoints (means it can be used with virtually any harness) - JSON constrained output which basically ensures tool call correctness. - routing mode

[−] arjie 55d ago
That's surprisingly fast. Thanks for sharing.
[−] outlog 55d ago
What is power usage? maybe https://www.coconut-flavour.com/coconutbattery/ can tell you estimate?
[−] tarruda 55d ago
I don't think I've ever seen the M1 ultra GPU exceed 80w in asitop.

Update: I just did a quick asitop test while inferencing and the GPU power was averaging at 53.55

[−] iwontberude 55d ago
Thank you, I have been using way too much credits for my personal automation.
[−] woile 55d ago
Just a single m1 ultra?
[−] tarruda 55d ago
Yes. Note that the only reason I acquired this device was to run LLMs, so I can dedicate its whole RAM to it. Probably not viable for a 128G device where you are actively using for other things.
[−] Aurornis 55d ago
Reading the details, he is using 2-bit quantization and reduced the number of experts per token from 10 down to 4 to get 5 tokens/sec. Cool proof of concept but it’s far from the quality and performance of the 397B model as normally used. Dropping the number of experts is particularly misleading.

This is some interesting work, but applying such extreme measures to LLMs to get them to run severely degrades quality. I know he claims negligible quality loss, but in my experience 2-bit quantizations are completely useless for real work. You can get them to respond to prompts, but they lose their intelligence and will go around in circles.

He also shows 5-6 tokens per second. Again that’s impressive for a large model on limited hardware but it’s very slow. Between the severely degraded model abilities and the extremely slow output the 397B result should be considered an attempt at proving something can technically run, not evidence that it can run well and produce output you’d expect from a 397B model.

He even mentions the obvious problems with his changes:

> *2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable.

So right out of the gate this isn’t useful if you want to do anything with it. He could have tried smaller models or less quantizations to get actual useful output from the model, but it wouldn’t look as impressive. It’s honestly getting kind of exhausting to read all of these AI-coded (admitted in the link) and AI-written papers made more for resume building. It would have been interesting to see this work applied to running a useful model that hadn’t been lobotomized instead of applying tricks to get an impressive headline but useless output.

[−] jllyhill 55d ago
To be honest, I'm getting tired of a "laptop" in every one of these clickbait titles turning out to be $3000 Macbook. Sure, it's impressive to achieve this degree of the LLM compression, but I really don't like that the title implies local LLM becomes a viable for an average person with the actual hardware being out of reach for 99%.
[−] homarp 55d ago
[−] zozbot234 55d ago
The github page mentions that a naïve mmap approach is bottlenecked by per-page overhead. Can this be mitigated by setting up explicit "huge" pages? (2M using the CONT PTE feature if the "native" page size is 16k; 32M using a PMD level block mapping; or 1G using the CONT PMD feature.) Does macOS support this out of the box? Alternatively, one might use a simple mmap and then something like posix_fadvise to set up prefetching of the data.
[−] justacatbot 55d ago
The quality degradation at 2-bit is a real issue. For actual work tasks, a well-tuned 30B at 4-bit usually outperforms a 70B+ at 2-bit in my experience. The expert reduction on top of that compounds things - you're essentially running a fairly different model. Still interesting to see the upper bound of what consumer hardware can attempt, even if the result isn't production-ready.
[−] bertili 55d ago
Very impressive! I wonder if there is a similar path for Linux using system memory instead of SSD? Hell, maybe even a case for the return of some kind of ROMs of weights?
[−] andai 55d ago

> Metal Compute Shaders — Hand-written Metal kernels

Hand written... by GPT? ;)

[−] RandyOrion 54d ago
This project shows an interesting automated search for engineering problems that I like to see more.

The experience of utilizing tiered storage (gpu vram, ram, and ssd) is generally poor for a lot of LLM inference engines out there, e.g., llama.cpp, sglang, vllm, etc..

My own experience shows that both weight and KV cache offload to ram on sglang and vllm is unavailable or unusable. Copying extra parameters from documents and adding them to already working commands results in errors. Llama.cpp does support weight offload, but the experience is not pleasant, low pcie (gpu <-> ram) utilization, low gpu utilization, and really low tokens per second.

[−] mkw 55d ago
TLDR I took a stab at leveraging Dan's work and making it more practical:

https://github.com/matt-k-wong/mlx-flash

2 bit quantization lobotomizes the model but is impressive nonetheless! Maybe one day we'll be able to have intelligent 2 bit quants... I wonder.

my version supports - 4bit quantization, hybrid streaming (Disk + ram), arbitrary model compatibility, tested on Mamba2, and lets up the framework for LM Studio integration

I leveraged this work (Credit to Danveloper) and am in the middle of making this work on more practical models and quants. It still uses flash streaming, but done so with a control knob so you can choose how much ram and how little ram to use. In the craziest case, it uses as little ram as possible but is very slow, however, in the balanced case you use some ram and it's much faster.

I designed it around the intelligence dense Nemotron 3 Nano 30B and Nemotron Cascade 2 30B models (which are smaller, more intelligence density) and can run on low end 16GB machines, though you can run arbitrarily large models on larger machines (designed for very low end, but capable of high end).

[−] JSR_FDED 55d ago
This is a very impressive result. If I understand correctly the bottleneck is the SSD in this architecture - the author seems to get almost 15GB/s - but I seem to remember the max b/w was about 8GB/s. What am I missing?
[−] druide67 53d ago
The finding about removing the 9.8 GB Metal LRU cache for a 38% speedup is the most interesting part. Same lesson as PostgreSQL's advice against application-level buffer pools that compete with the OS page cache : the hardware memory compressor doing 130K decompressions/sec was pure overhead.

Curious about the remaining gap: 5.7 tok/s vs 18.6 theoretical (from SSD bandwidth). Is the ~70% overhead mostly GPU compute on non-expert layers (attention, norm), or is there I/O scheduling room left?

[−] spwa4 55d ago
Does this mean that it should be possible to load up a system with ~10 (seems to me at least the number of active experts) SSDs to get 40 tok/s even on truly gigantic models?
[−] shubhamintech 54d ago
4.4 tok/s with reliable structured output is a solid local benchmark altho the question is whether SSD streaming introduces per-token latency variance that messes up tool call parsing downstream. The gap between 400 GB/s unified memory bandwidth and 17.5 GB/s SSD reads means you're in the hot path pretty much every time an expert isn't cached.
[−] qiine 55d ago
It seem strange to me that the only way to use an llm is to fit it entirely in volatile memory from the get go.

To render movies we happily wait for the computer to calculate how lights bounce around, for hours even days.

So why not do the same with AIs? Ask big question to big models and get the answer to the universe tomorrow?

[−] maxloh 55d ago
Can you add a license to the repo? Legally we couldn't run any code without a license attached to it.
[−] haomingkoo 55d ago
Really interesting approach. Curious how the 2-bit quantization affects the model's reasoning ability on longer chains of thought vs shorter prompts. The benchmarkslook solid but real-world usage seems like a different story based on the comments here.
[−] 999900000999 54d ago
If I have a dedicated GPU with 12GB of VRAM and 32 GB of system ram, can I combine the two for LLMs.

So far ollama will use the 12GB and then give up

[−] m-hodges 55d ago
As frontier models get closer and closer to consumer hardware, what’s the most for the API-driven $trillion labs?
[−] lostmsu 55d ago
How large is the KV cache?
[−] 383toast 55d ago
yeah 4tok/s is kinda unusable though
[−] breakingcups 55d ago

> No Python. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders.

Welp, I know where those tokens came from.

[−] mannyv 55d ago
Everyone is focused on the bad 2 bit result but who cares? He says don’t use it because it’s bad.
[−] pdyc 55d ago
impressive, i wish someone takes a stab at using this technique on mobile gpu's even if it does not use storage it would still be a win. I am running llama.cpp on adreno 830 with oepncl and i am getting pathetic 2-3t/s for output tokens
[−] matchbox 55d ago
this is awesome Dan!
[−] utopiah 54d ago
I honestly don't get "why" despite having done similar things myself, e.g. run on a model on a VR headset itself.

I mean I've done it because I could, so I imagine others are doing that too. But then... once it's done I don't actually use it. I ticked that box but eventually when STOA aren't that useful I have a hard time imagining actual positive use cases (... not like offline spam or naughty chat in the woods) that would benefit from such technically impressive demos.

[−] NamlchakKhandro 54d ago
lmao 4.4 tokens per second is hilariously and utterly bad.

anyone suggesting that it's a reasonable speed should find another career

[−] claud_ia 54d ago
[dead]
[−] robutsume 55d ago
[dead]
[−] maxothex 54d ago
[dead]
[−] fluxist 55d ago
[dead]
[−] Yanko_11 54d ago
[dead]
[−] Yanko_11 55d ago
[dead]
[−] openclaw01 54d ago
[dead]
[−] leontloveless 55d ago
[dead]
[−] diablevv 55d ago
[flagged]
[−] leontloveless 55d ago
[dead]
[−] leontloveless 55d ago
[dead]
[−] leontloveless 55d ago
[dead]
[−] gregfrank 54d ago
[dead]
[−] aplomb1026 55d ago
[dead]
[−] thestack_ai 54d ago
[dead]
[−] qcautomation 55d ago
[dead]
[−] jee599 54d ago
[dead]
[−] jee599 55d ago
[dead]
[−] arikrahman 54d ago
[dead]
[−] dmonterocrespo 54d ago
[dead]