Quantization from the Ground Up

[−] msbhogavi 51d ago

The hardware situation is way better than you think, and quantization is a huge part of why.

Take Qwen 3.5 27B, which is a solid coding model. At FP16 it needs 54GB of VRAM. Nobody's running that on consumer hardware. At Q4_K_M quantization, it needs 16GB. A used RTX 3090 has 24GB and goes for about $900. That model runs locally with room for context.

For 14B coding models at Q4, you're looking at about 10GB. A used RTX 3060 12GB handles that for under $270.

The gap between "needs a datacenter" and "runs on my desk" is almost entirely quantization. A 27B model at Q4 loses surprisingly little quality for most coding tasks. It's not free, but it's not an RTX 7090 either. A used 3090 is probably the most recommended card in the local LLM community right now, and for good reason.

[−] rdos 51d ago

14B even at Q4 isn't realistic for coding on a single 12GB RTX 3060. Token speed is too slow. After all they are dense models. You aren't getting a good MoE model under 30B. You can do OCR, STT, TTS really well and for LLMs, good use cases are classification, summarization and extraction with <10B models.

[−] suprjami 51d ago

Dual 3060s run 24B Q6 and 32B Q4 at ~15 tok/sec. That's fast enough to be usable.

Add a third one and you can run Qwen 3.5 27B Q6 with 128k ctx. For less than the price of a 3090.

[−] rdos 38d ago

Sure, two 3060 can pull usable performance on an usable LLM, but a single one can't (yet).

> 3x RTX 3060 less tgab the price of a 3090

Interesting, here it is around the same. 200-250€ for a used 12GB 3060 and 600-800 for a used 3090€.

[−] faangguyindia 51d ago

U are better off just buying their coding plan.

Running LLM makes no sense whatsoever

[−] oompydoompy74 51d ago

Remaining dependent on proprietary frontier models that you can only access via an API makes no sense whatsoever. My hope is that the future is open weight models running on local hardware.

[−] AbanoubRodolf 51d ago

[flagged]

[−] epaulson 51d ago

I was a little confused by this part:

"This is what's happening to the parameters of models when they're quantized down to sizes that are possible to run on your laptop. Instead of floats, small integers are what get stored and loaded into memory. When the time comes to use the quantized values, to generate an answer to a question for example, the values are dequantized on the fly. You might think this sounds slower, but we'll see later on that this actually ends up being faster as well as smaller."

I thought that most GPUs supported floating point math in these quantized formats, like they can natively do math on an float4 number (that's maybe packed, 2 float4s into a single byte, or more probably 16 float4s in an 8 byte array or maybe something even bigger)

Am I getting this wrong - is it instead the GPU pulls in the quantized numbers and then converts them back into 32-bit or 64-bit float to actually run through the ALUs on the GPU? (and the memory bandwidth savings make up for the extra work to convert them back into 32 bit numbers once you get them onto the GPU?)

Or is it some weird hybrid, like there is native support for float8 and Bfloat16, but if you want to use float2 you have to convert it to float4 or something the hardware can work with.

I am confused what actually happens in the vectorized ADD and MULT instructions in the GPU with these quantized numbers.

[−] aarondf 52d ago

My word... samwho is doing some of the best technical explainers on the internet right now.

[−] armcat 52d ago

This is beautifully written and visualised, well done! The KL divergence comparisons between original and different quantisation levels is on-point. I'm not sure people realize how powerful quantisation methods are and what they've done for democratising local AI. And there are some great players out there like Unsloth and Pruna.

[−] gavinray 51d ago

I read the entire thing top-to-bottom, as a visual learner this is superb.

One nitpick -- in the "asymmetric quantification" code, shouldn't "zero" be called "midpoint" or similar? Or is "zero" an accepted mathematics term in this domain?

[−] mrsilencedogood 52d ago

Quantization is important for me because it's the only way out I can see for a future of programming that doesn't involve going through a giant bigco who can run, as the article says, a machine with 2TB of memory. And not just memory, but my understanding is that for the model to be performant, it has to be VRAM to boot.

This comes as the latest concern of mine in a long line around "how software gets written" remaining free-as-in-freedom. I've always been really uneasy about how reliant many programming languages were on Jetbrains editors, only vaguely comforted by their "open-core" offering, which naturally only existed for languages with strong OSS competition for IDEs (so... java and python, really). "Intellisense" seemed very expensive to implement and was hugely helpful in writing programs without stopping every 4 seconds to look up whether removing whitespace at the end of a line is trim, strip, or something else in this language. I was naturally pleased to see language servers take off, even if it was much to my chagrin that it came from Microsoft, who clearly was out of open standards to EEE and decided to speed up the process by making some new ones.

Now LLMs are the next big worry of mine. It seems pretty bad for free and open software if the "2-person project, funded indirectly by the welfare state of a nordic or eastern-european nation" model that drives ridiculously important core libre/OSS libraries now is even less able to compete with trillion dollar corporations.

Open-weight, quantized, but still __good__ models seem like the only way out. I remain somewhat hopeful just from how far local models have come - they're significantly more usable than they were a year ago, and we've got more tools like LM Studio etc making running them easy. But there's still a good way to go.

I'll be sad if a "programming laptop" ends up going from "literally anything that can run debian" to "yeah you need an RTX 7090, 128GB of VRAM, and the 2kW wearable power supply backpack addon at a minimum".

[−] AIorNot 51d ago

Man what a brilliant technical essay.. hat's off to the writer for clarity and visualizations.

[−] steve_adams_86 51d ago

Sam's previous posts are well worth digging up too. This one is outstanding, but they're all good. I really enjoyed this and learned a lot.

I'm a bit envious of his job. Learning to teach others, and building out such cool interactive, visual documents to do it? He makes it look easier than it is, of course. A lot of effort and imagination went into this, and I'm sure it wasn't a walk in the park. Still, it seems so gratifying.

[−] muskstinks 51d ago

The 2 bit is probably slower because it clashes with some register sizes and how data is read in blocks. No additional benefit because the architecture doesn't read 2 bits but probably min 4 bits and then it clashes with utilization.

Really good visualizations overall.

[−] cphoover 52d ago

5-10% accuracy is like the difference between a usable model, and unusable model.

[−] gurachek 51d ago

The float comparison slider is great.

One thing from practical experience - the quality gap between model sizes shows up in a way benchmarks don't capture. I have a system where a smaller model generates plans and a larger model can override them. On any single output they look comparable. The difference shows up 3-4 steps later — small model makes a decision that sounds reasonable but compounds into a bad plan. Perplexity won't catch that, KL divergence won't either. They both measure one prediction at a time.

[−] stuxnet79 51d ago

What is the best way to archive a JS heavy site like this? I reviewed OPs github and they haven't open-sourced these visualizations probably because they are tied to his employer.

[−] krackers 51d ago

Most (all?) of this holds for quantizing convnets too, if you're looking for an easy exercise you can play around with quantizing resnet50 or something and plotting layer activations

[−] fcpk 52d ago

something I have been wondering about is doing regressive layer specific quantization based on large test sets. ie reduce very specifically layers that don't improve general quality.

[−] nazgulsenpai 51d ago

This isn't just a good explainer of quantization, it's a good overview of LLMs in general.

[−] aeve890 51d ago

Oh, _that_ quantization.

[−] maxilevi 51d ago

since when ngrok is doing ai

[−] eddie-wang 51d ago

[dead]

[−] diablevv 51d ago

[dead]

[−] leontloveless 51d ago

[dead]

[−] maltyxxx 51d ago

[flagged]

[−] vicchenai 52d ago

[dead]

[−] leontloveless 51d ago

[dead]

[−] myylogic 51d ago

[dead]

[−] hikaru_ai 51d ago

[dead]

[−] openclaw01 51d ago

[dead]

Quantization from the Ground Up (ngrok.com)

59 comments