Lemonade by AMD: a fast and open source local LLM server using GPU and NPU (lemonade-server.ai)

by AbuAssar 111 comments 572 points
Read article View on HN

111 comments

[−] dennemark 43d ago
I have been using lemonade for nearly a year already. On Strix Halo I am using nothing else - although kyuz0's toolboxes are also nice (https://kyuz0.github.io/amd-strix-halo-toolboxes/)

Nowadays you get TTS, STT, text & image generation and image editing should also be possible. Besides being able to run via rocm, vulkan or on CPU, GPU and NPU. Quite a lot of options. They have a quite good and pragmatic pace in development. Really recommend this for AMD hardware!

Edit: OpenAI and i think nowaday ollama compatible endpoints allow me to use it in VSCode Copilot as well as i.e. Open Web UI. More options are shown in their docs.

[−] UncleOxidant 43d ago
How much of a speedup might I get for, say, Qwen3.5-122B if I were to run with lemonade on my Strix Halo vs running it using vulkan with llama.cpp ?
[−] sawansri 43d ago
You would get similar performance. Lemonade is designed as a turnkey (optimized for AMD Hardware) for local AI models. The software helps you manage backends (llama.cpp, flm, whispercpp, stable‑diffusion.cpp, etc) for different GenAI modalities from a single utility.

On the performance side, lemonade comes bundled with ROCm and Vulkan. These are sourced from https://github.com/lemonade-sdk/llamacpp-rocm and https://github.com/ggml-org/llama.cpp/releases respectively.

[−] syntaxing 43d ago
Have you used it with any agents or claw? If so, which model do you run?
[−] dennemark 43d ago
I have two Strix Halo devices at hand. Privately a framework desktop with 128gb and at work 64GB HP notebook. The 64GB machine can load Qwen3.5 30B-A3B, with VSCode it needs a bit of initial prompt processing to initialize all those tools I guess. But the model is fighting with the other resources that I need. So I am not really using it anymore these days, but I want to experiment on my home machine with it. I just dont work on it much right now.

Lemonade has a Web UI to set the context size and llama.cpp args, you need to set context to proper number or just to 0 so that it uses the default. If its too low, it wont work with agentic coding.

I will try some Claw app, but first need to research the field a bit. But I am using different models on Open Web UI. GPT 120B is fast, but also Qwen3.5 27B is fine.

[−] cpburns2009 43d ago
Qwen3-Coder-Next works well on my 128GB Framework Desktop. It seems better at coding Python than Qwen3.5 35B-A3B, and it's not too much slower (43 tg/s compared to 55 tg/s at Q4).

27B is supposed to be really good but it's so slow I gave up on it (11-12 tg/s at Q4).

[−] UncleOxidant 43d ago
Agreed. Qwen3-coder-next seems like the sweetspot model on my 128GB Framework Desktop. I seem to get better coding results from it vs 27b in addition to it running faster.
[−] vlowther 43d ago
The 8 bit MLX unsloth quant of qwen3-coder-next seems to be a local best on an MBB M5 Max with 128GB memory. With oMLX doing prompt caching I can run two in parallel doing different tasks pretty reasonably. I found that lower quants tend to lose the plot after about 170k tokens in context.
[−] lrvick 43d ago
As another data point.

Running Qwen3.5 122B at 35t/s as a daily driver using Vulcan llama.cpp on kernel 7.0.0rc5 on a Framework Desktop board (Strix Halo 128).

Also a pair of AMD AI Pro r9700 cards as my workhorses for zimageturbo, qwen tts/asr and other accessory functions and experiments.

Finally have a Radeon 6900 XT running qwen3.5 32B at 60+t/s for a fast all arounder.

If I buy anything nvidia it will be only for compatibility testing. AMD hardware is 100% the best option now for cost, freedom, and security for home users.

[−] rizzo94 42d ago
[dead]
[−] sensitiveCal 43d ago
Feels like this is sitting somewhere between Ollama and something like LM Studio, but with a stronger focus on being a unified “runtime” rather than just model serving.

The interesting part to me isn’t just local inference, but how much orchestration it’s trying to handle (text, image, audio, etc). That’s usually where things get messy when running models locally.

Curious how much of this is actually abstraction vs just bundling multiple tools together. Also wondering if the AMD/NPU optimizations end up making it less portable compared to something like Ollama in practice.

[−] moconnor 43d ago
Is... is this named because they have a lemon they're trying to make the most of?
[−] zozbot234 43d ago
Note that the NPU models/kernels this uses are proprietary and not available as open source. It would be nice to develop more open support for this hardware.
[−] JSR_FDED 43d ago
I’ve read the website and the news announcement, and I still don’t understand what it is. An alternative to LM Studio? Does it support MLX or metal on Macs? I’m assuming it will optimize things for AMD, but are you at a disadvantage using other GPUs?
[−] rpdillon 43d ago
Been running lemonade for some time on my Strix Halo box. It dispatches out to other backends that they include, like diffusion and llama. I actually don't like their combined server, and what I use instead is their llama CPP build for ROCm.

https://github.com/lemonade-sdk/llamacpp-rocm

But I'm not doing anything with images or audio. I get about 50 tokens a second with GPT OSS 120B. As others have pointed out, the NPU is used for low-powered, small models that are "always on", so it's not a huge win for the standard chatbot use case.

[−] jmillikin 43d ago
Surprising that the Linux setup instructions for the server component don't include Docker/Podman as an option, its Snap/PPA for Ubuntu and RPM for Fedora.

Maybe the assumption is that container-oriented users can build their own if given native packages?

[−] steffs 43d ago
The multi-modal bundling is the part that stands out more than the raw inference speed. If you are building an app that needs text generation, image generation, and speech recognition, right now the local setup is three separate services with three different APIs and three different model management stories. Having one server handle all of that behind OpenAI-compatible endpoints is a real quality of life improvement for anyone prototyping locally. The NPU angle is interesting but probably overstated for most use cases. The discussion in the thread confirms what I would expect: NPUs shine for small always-on models and prefill offloading, not for the chatbot workloads most people care about. Where this gets genuinely compelling is if AMD can make the combined GPU plus NPU scheduling transparent enough that developers do not need to think about which hardware is running which part of the pipeline. That is not a solved problem on any platform yet, and if Lemonade gets it right for even a subset of workloads, it becomes the default choice on AMD hardware regardless of how it benchmarks against Ollama on pure text generation.
[−] nijave 43d ago
Anyone compare to ollama? I had good success with latest ollama with ROCm 7.4 on 9070 XT a few days ago
[−] cpburns2009 43d ago
Just in case anyone isn't aware. NPUs are low power, slow, and meant for small models.
[−] gnarlouse 43d ago
Maybe it's a language barrier problem, but "by AMD" makes me think its a project distributed by AMD. Is that actually the case? I'm not seeing any reason to believe it is.
[−] freedomben 43d ago
Neat, they have rpm, deb, and a companion AppImage desktop app[1]! Surprised I wasn't aware of this project before. Definitely going to give it a try.

[1]: https://github.com/lemonade-sdk/lemonade/releases/tag/v10.0....

[−] bravetraveler 43d ago
A fun observation: pulling models sends ~200mbit of progress updates to your browser
[−] pantalaimon 43d ago
It's pretty annoying that you need vendor specific APIs and a large vendor specific stack to do anything with those NPUs.

This way software adoption will be very limited.

[−] syntaxing 43d ago
Wow this is super interesting. This creates a local “Gemini” front end and all. This is more or less a generative AI aggregator where it installs multiple services for different gen modes. I’m excited to try this out on my strix halo. The biggest issue I had is image and audio gen so this seems like a great option.
[−] kouunji 43d ago
I’m looking forward to trying this currently Strix halo’s npu isn’t accessible if you’re running Linux, and previously I don’t think lemonade was either. If this opens up the npu that would be great! Resolute raccoon is adding npu support as well.
[−] ilaksh 43d ago
Cool but is there a reason they can't just make PRs for vLLM and llama.cpp? Or have their own forks if they take too long to merge?
[−] metalliqaz 43d ago
my most powerful system is Ryzen+Radeon, so if there are tools that do all the hard work of making AI tools work well on my hardware, I'm all for it. I find it very frustrating to get LLMs, diffusion, etc. working fast on AMD. It's way too much work.
[−] Sparkyte 43d ago
What is the lowest process I can implement this on?
[−] LowLevelKernel 43d ago
Which specific NPU’s?
[−] robotswantdata 43d ago
Forget all the vibe coded slop or Ollama. Lemonade is the real deal and very good, been using about a year now.

AMD are doing gods work here

[−] ozgrakkurt 43d ago
For people with AMD card. This is garbage, rocm is garbage. Just install llama.cpp and run llama-server with vulkan option. This is just some slop + JS/Electron garbage put on top.
[−] 9dc 43d ago
so... what does it do? i dont get it Lol
[−] Caum 43d ago
[flagged]
[−] sparkupcloud 32d ago
[dead]
[−] techpulselab 43d ago
[flagged]
[−] philbitt 43d ago
[dead]
[−] johnwhitman 41d ago
[flagged]
[−] devnotes77 43d ago
[dead]
[−] aplomb1026 43d ago
[dead]
[−] wei03288 43d ago
[dead]
[−] aplomb1026 43d ago
[dead]
[−] avib99 43d ago
[dead]
[−] momo_dev 43d ago
[dead]
[−] arafeq 43d ago
[flagged]
[−] shubhamgarg86 43d ago
[flagged]
[−] silask58 43d ago
[dead]
[−] spencer9714 43d ago
[flagged]
[−] luxuryballs 43d ago
this is funny I’m working on building an AI project called lemonade right now