Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs

[−] jjcm 45d ago

1 bit with a FP16 scale factor every 128 bits. Fascinating that this works so well.

I tried a few things with it. Got it driving Cursor, which in itself was impressive - it handled some tool usage. Via cursor I had it generate a few web page tests.

On a monte carlo simulation of pi, it got the logic correct but failed to build an interface to start the test. Requesting changes mostly worked, but left over some symbols which caused things to fail. Required a bit of manual editing.

Tried a Simon Wilson pelican as well - very abstract, not recognizable at all as a bird or a bicycle.

Pictures of the results here: https://x.com/pwnies/status/2039122871604441213

There doesn't seem to be a demo link on their webpage, so here's a llama.cpp running on my local desktop if people want to try it out. I'll keep this running for a couple hours past this post: https://unfarmable-overaffirmatively-euclid.ngrok-free.dev

[−] najarvg 45d ago

Thanks for sharing the link to your instance. Was blazing fast in responding. Tried throwing a few things at it with the following results: 1. Generating an R script to take a city and country name and finding it's lat/long and mapping it using ggmaps. Generated a pretty decent script (could be more optimal but impressive for the model size) with warnings about using geojson if possible 2. Generate a latex script to display the gaussian integral equation - generated a (I think) non-standard version using probability distribution functions instead of the general version but still give it points for that. Gave explanations of the formula, parameters as well as instructions on how to compile the script using BASH etc 3. Generate a latex script to display the euler identity equation - this one it nailed.

Strongly agree that the knowledge density is impressive for the being a 1-bit model with such a small size and blazing fast response

[−] jjcm 45d ago

> Was blazing fast in responding.

I should note this is running on an RTX 6000 pro, so it's probably at the max speed you'll get for "consumer" hardware.

[−] ineedasername 45d ago

consumer hardware?

That... pft. Nevermind, I'm just jealous

[−] jjcm 45d ago

Look it was my present to myself after the Figma IPO (worked there 5 years). If you want to feel less jealous, look at the stock price since then.

[−] Dylan16807 44d ago

Well in this context it's a 5090 with extra unused memory.

[−] abrookewood 45d ago

Holy hell ... that's a monster of a card

[−] najarvg 45d ago

I must add that I also tried out the standard "should I walk or drive to the carwash 100 meters away for washing the car" and it made usual error or suggesting a walk given the distance and health reasons etc. But then this does not claim to be a reasoning model and I did not expect, in the remotest case, for this to be answered correctly. Ever previous generation larger reasoning models struggle with this

[−] jjcm 45d ago

I ran it through a rudimentary thinking harness, and it still failed, fwiw:

    The question is about the best mode of transportation to a car wash located 100 meters away. Since the user is asking for a recommendation, it's important to consider practical factors like distance, time, and convenience.

    Walking is the most convenient and eco-friendly option, especially if the car wash is within a short distance. It avoids the need for any transportation and is ideal for quick errands.
    Driving is also an option, but it involves the time and effort of starting and stopping the car, parking, and navigating to the location.
    Given the proximity of the car wash (100 meters), walking is the most practical and efficient choice. If the user has a preference or if the distance is longer, they can adjust accordingly.

[−] nlaslett 44d ago

And to be fair, you asked about traveling to a location. It just so happens that location is a car wash. You didn't say anything about wanting to wash the car; that's an inference on your part. A reasonable inference based on human experience, sure, but still an inference. You could just as easily want to go to the car wash because that's where you work, or you are meeting somebody there.

[−] monarchwadia 44d ago

Honestly, the fact that we have models that can coherently reason about this problem at all is a technological miracle. And to have it runnable in a 1.15GB memory footprint? Is insanity.

[−] CamperBob2 44d ago

Exactly. It's not that the pig dances poorly, or that the dog's stock tips never seem to pan out. It's the fact that it's happening at all.

[−] monarchwadia 44d ago

But the fact that we have convinced a pig to dance, and trained a dog to provide stock tips? That can be improved upon over time. We've gotten here, haven't we? It really is a miracle, and I'll stick to that opinion.

[−] adityashankar 45d ago

here's the google colab link, https://colab.research.google.com/drive/1EzyAaQ2nwDv_1X0jaC5... since the ngrok like likely got ddosed by the number of individuals coming along

[−] qingcharles 45d ago

Thanks, that works. I only tested the 1.7B. It has that original GPT3 feel to it. Hallucinates like crazy when it doesn't know something. For something that will fit on a GTX1080, though, it's solid.

We're only a couple of years into optimization tech for LLMs. How many other optimizations are we yet to find? Just how small can you make a working LLM that doesn't emit nonsense? With the right math could we have been running LLMs in the 1990s?

[−] jjcm 45d ago

Good call. Right now though traffic is low (1 req per min). With the speed of completion I should be able to handle ~100x that, but if the ngrok link doesn't work defo use the google colab link.

[−] AnthonBerg 45d ago

As someone whose brain was addled by exposure to art history, I strongly support the suggested pelican on bicycle.

[−] andai 45d ago

Thanks. Did you need to use Prism's llama.cpp fork to run this?

[−] rjh29 45d ago

I reminds me of very early ChatGPT with mostly correct answers but some nonsense. Given its speed, it might be interesting to run it through a 'thinking' phase where it double checks its answers and/or use search grounding which would make it significantly more useful.

[−] uf00lme 45d ago

The speed is impressive, I wish it could be setup for similar to speculative decoding

[−] abrookewood 45d ago

man, that is really really quick. What is your desktop setup??? GPU?

[−] pdyc 45d ago

thanks, i tested it, failed in strawberry test. qwen 3.5 0.8B with similar size passes it and is far more usable.

[−] hmokiguess 45d ago

wow that was cooler than I expected, curious to embed this for some lightweight semantic workflows now

[−] tristanMatthias 45d ago

[dead]

[−] nl 45d ago

I ran my custom agentic SQL debugging benchmark against it and I'm impressed.

Results: 8 passed, 0 failed, 17 errored out of 25

That puts it right between Qwen3.5-4B (7/25) and Nanbeige4.1-3B (9/25) for example, but it took only 200 seconds for the whole test. Qwen3.5 took 976 seconds and Nanbeige over 2000 (although both of these were on my 1070 so not quite the same hardware)

Granite 7B 4bit does the test in 199 seconds but only gets 4/25 correct.

See https://sql-benchmark.nicklothian.com/#all-data (click on the cells for the trace of each question)

Errors are bad tool calls (vs failures which is incorrect SQL)

I used @freakynit's runpod (thanks!)

[1] https://news.ycombinator.com/item?id=47597268

[−] simonw 45d ago

You can run this model on an iPhone via the latest update to this Locally AI app: https://apps.apple.com/us/app/locally-ai-local-ai-chat/id674...

For its size (1.2GB download) it's very impressive.

Here's a pelican it drew me running on my phone - the SVG comments are good, the image not so much: https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...

[−] freakynit 45d ago

Open access for next 5 hours (8GiB model, running on RTX 3090) or until server crashes or the this spot instance gets taken away :) =>

https://ofo1j9j6qh20a8-80.proxy.runpod.net

  ./build/bin/llama-server \
   -m ../Bonsai-8B.gguf \
   -ngl 999 \
   --flash-attn on \
   --host 0.0.0.0 \
   --port 80 \
   --ctx-size 65500 \
   --batch-size 512 \
   --ubatch-size 512 \
   --parallel 5 \
   --cont-batching \
   --threads 8 \
   --threads-batch 8 \
   --cache-type-k q4_0 \
   --cache-type-v q4_0 \
   --log-colors on

The server can serve 5 parallel request, with each request capped at around 13K tokens...

A bit of of benchmarks I did:

1. Input: 700 tokens, ttfs: ~0 second, outputs: 1822 tokens ~190t/s

1. Input: 6400+ tokens, ttfs: ~2 second, outputs: 2012 tokens at ~135t/s

Vram usage was consistently at ~4GiB.

[−] wild_egg 45d ago

Don't have a GPU so tried the CPU option and got 0.6t/s on my old 2018 laptop using their llama.cpp fork.

Then found out they didn't implement AVX2 for their Q1_0_g128 CPU kernel. Added that and getting ~12t/s which isn't shabby for this old machine.

Cool model.

[−] alyxya 45d ago

I expect the trend of large machine learning models to go towards bits rather than operating on floats. There's a lot of inefficiency in floats because typically they're something like normally distributed, which makes the storage and computation with weights inefficient when most values are clustered in a small range. The foundation of neural networks may be rooted in real valued functions, which are simulated with floats, but float operations are just bitwise operations underneath. The only issue is that GPUs operate on floats and standard ML theory works over real numbers.

[−] drob518 45d ago

I’m really curious how this scales up. Bonsai delivers an 8B model in 1.15 GB. How large would a 27B or 35B model be? Would it still retain the accuracy of those large models? If the scaling holds, we could see 100+B models in 64 GB of RAM.

[−] _fw 45d ago

What’s the trade-off? If it’s smaller, faster and more efficient - is it worse performance? A layman here, curious to know.

[−] druskacik 45d ago

The 8B model response to my "Harry Potter knowledge-bench" question is too funny not to share.

> *Fathers of Harry and James Potter*: - Sirius Black is the *father* of *James Potter* (the older brother of Harry).

> - James Potter is *Harry's uncle* and the *older brother* of *Luna Lovegood*.

> - This means *Sirius and James are Harry's uncles*, though they are *father and brother*.

https://pastebin.com/WAAmFKfX

[−] WaterRun 45d ago

Feels a bit like gradually moving back toward analog circuits, step by step. There is less and less need for the precision that digital circuits provide.

[−] fxwin 45d ago

I'm very skeptical of the advantage they're claiming here. The whitepaper [0] only compares these to full precision models, when the more interesting (and probably more meaningful) comparison would be with other quantized models with a similar memory footprint.

Especially considering that these models seem to more or less just be quantized variants of Qwen3 with custom kernels and other inference optimizations (?) rather than fine tuned or trained from scratch with a new architecture, I am very surprised (or suspicious rather) that they didn't do the obvious comparison with a quantized Qwen3.

Their (to my knowledge) new measure/definition of intelligence seems reasonable, but introducing something like this without thorough benchmarking + model comparison is even more of a red flag to me.

[0] https://github.com/PrismML-Eng/Bonsai-demo/blob/main/1-bit-b...

[−] kent8192 45d ago

Oh, boy. This good tool hates my LM Studio... The following message appears when I run Bonsai in my LM Studio. I think my settings have done something wrong. ``


 Failed to load the model
Error loading model.
(Exit code: null). Please check the settings and try loading the model again.

``

[−] Archit3ch 45d ago

Doesn't Jevons paradox dictate larger 1-bit models?

[−] andai 45d ago

Does anyone know how to run this on CPU?

Do I need to build their llama.cpp fork from source?

Looks like they only offer CUDA options in the release page, which I think might support CPU mode but refuses to even run without CUDA installed. Seems a bit odd to me, I thought the whole point was supporting low end devices!

Edit: 30 minutes of C++ compile time later, I got it running. Although it uses 7GB of RAM then hangs at Loading model. I thought this thing was less memory hungry than 4 bit quants?

Edit 2: Got the 4B version running, but at 0.1 tok/s and the output seemed to be nonsensical. For comparison I can run, on the same machine, qwen 3.5 4B model (at 4 bit quant) correctly and about 50x faster.

[−] ycui1986 45d ago

i hope someone do a 100b 1-bit parameter model. that should fit into most 16GB graphics cards. local AI democratized.

[−] Udo 44d ago

This looks very promising. It would be cool if support for Bonsai-style models would land in mainline MLX soon, looking forward to trying it out.

It seems PrismML has implemented a better version of an idea I had a while back: what if we had a 1-bit model where the scale of the weight is determined by its position. The model would have to be trained from the ground up for this though, which is why I never tried it. The interleaved scale factor approach of Bonsai is a much more flexible approach at almost the same cost.

[−] plombe 45d ago

Interesting post. Curious to know how they arrived at intelligence density = Negative log of the model's error rate divided by the model size.

[−] iJohnDoe 44d ago

Tried running the models with the latest LM Studio, llama.cpp, and Ollama. All failed.

https://huggingface.co/prism-ml/Bonsai-8B-gguf

tensor 'token_embd.weight' has invalid ggml type 41. should be in [0, 41) loader knows tensor types 0..40, but the model contains type 41

[−] ide0666 44d ago

Interesting parallel to spiking neural networks — they're essentially 1-bit communication (spike or no spike) with analog membrane potentials. We use 5k Izhikevich neurons for quadruped locomotion control and they beat PPO at the same sample budget. The efficiency argument for 1-bit goes beyond LLMs.

[−] andai 45d ago

The site says 14x less memory usage. I'm a bit confused about that situation. The model file is indeed very small, but on my machine it used roughly the same RAM as 4 bit quants (on CPU).

Though I couldn't get actual English output from it, so maybe something went wrong while running it.

[−] ant28 44d ago

What's up with - log error / model size? I'm not an LLM person, but a ratio of ~1 means a roughly 40% error rate for its size? I don't follow

(math: - log error / model size = 1 <-> error / model size = 1/e )

[−] marak830 45d ago

It's been a hell of a morning for llama heads - first this, then the claude drop and turboquant.

I'm currently setting this one up, if it works well with a custom LoRa ontop ill be able to run two at once for my custom memory management system :D

[−] syntaxing 45d ago

Super interesting, building their llama cpp fork on my Jetson Orin Nano to test this out.

[−] w10-1 44d ago

anecdotal experience report:

They link the (free) locally.ai iPhone app, but the bonsai model doesn't present in the list. You have to get it via settings.

On my ancient SE-2, Siri integration falls down, but the chat in their app runs about half the speed I can read. So far, more than 50% correct, and usable (and seems to speed up as you use it).

I'll try it just to clean up input in a pipeline to another model. I gave it a paragraph from the NYTimes and it did a great job, so it should be good at correcting voice input and keyboard typos.

[−] Nihilartikel 44d ago

Sounds like about the right level of cognition for a talkie toaster!

[−] bilsbie 45d ago

I can’t see how this is possible. You’re losing so much information.

[−] stogot 45d ago

What is the value of a 1 bit? For those that do not kno

[−] wshell 45d ago

What would be a good TTS to run with this?

[−] WhitneyLand 44d ago

A bit misleading to say they take 14x less memory, no one is doing inference with 16-bit models.

[−] keyle 45d ago

Extremely cool!

Can't wait to give it a spin with ollama, if ollama could list it as a model that would be helpful.

[−] ariwilson 45d ago

Very cool and works pretty well!

[−] ggamezar 45d ago

Misses comparison with qwen 3.5, though mentioned qwen 3. Is there a reason why?

[−] yodon 45d ago

Is Bonsai 1 Bit or 1.58 Bit?

[−] vx_r 45d ago

Eagerly waiting for mlx to merge 1bit quantization pr to try this out.

[−] naasking 44d ago

Great! I hope the era of 1-bit LLMs really gets going.

[−] p0u4a 45d ago

How much does training such a model cost?

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs (prismml.com)

153 comments