The local LLM ecosystem doesn’t need Ollama (sleepingrobots.com)

by Zetaphor 209 comments 648 points
Read article View on HN

209 comments

[−] cientifico 29d ago
For most users that wanted to run LLM locally, ollama solved the UX problem.

One command, and you are running the models even with the rocm drivers without knowing.

If llama provides such UX, they failed terrible at communicating that. Starting with the name. Llama.cpp: that's a cpp library! Ollama is the wrapper. That's the mental model. I don't want to build my own program! I just want to have fun :-P

[−] anakaine 29d ago
Llama.cpp now has a gui installed by default. It previously lacked this. Times have changed.
[−] nikodunk 29d ago
Having read above article, I just gave llama.cpp a shot. It is as easy as the author says now, though definitely not documented quite as well. My quickstart:

brew install llama.cpp

llama-server -hf ggml-org/gemma-4-E4B-it-GGUF --port 8000

Go to localhost:8000 for the Web UI. On Linux it accelerates correctly on my AMD GPU, which Ollama failed to do, though of course everyone's mileage seems to vary on this.

[−] teekert 29d ago
Was hoping it was so easy :) But I probably need to look into it some more.

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4' llama_model_load_from_file_impl: failed to load model

Edit: @below, I used nix-shell -p llama-cpp so not brew related. Could indeed be an older version indeed! I'll check.

[−] adrian_b 29d ago
As it has been discussed in a few recent threads on HN, whenever a new model is released, running it successfully may need changes in the inference backends, such as llama.cpp.

There are 2 main reasons. One is the tokenizer, where new tokenizer definitions may be mishandled by the older tokenizer parsers.

The second reason is that each model may implement differently the tool invocations, e.g. by using different delimiter tokens and different text layouts for describing the parameters of a tool invocation.

Therefore running the Gemma-4 models encountered various problems during the first days after their release, especially for the dense 31B model.

Solving these problems required both a new version of llama.cpp (also for other inference backends) and updates in the model chat template and tokenizer configuration files.

So anyone who wants to use Gemma-4 should update to the latest version of llama.cpp and to the latest models from Huggingface, because the latest updates have been a couple of days ago.

[−] roosgit 29d ago
I just hit that error a few minutes ago. I build my llama.cpp from source because I use CUDA on Linux. So I made the mistake of trying to run Gemma4 on an older version I had and I got the same error. It’s possible brew installs an older version which doens’t support Gemma4 yet.
[−] teekert 29d ago
Ah it was indeed just that!

I'm now on:

$ llama --version version: 8770 (82764d8) built with GNU 15.2.0 for Linux x86_64

(From Nix unstable)

And this works as advertised, nice chat interface, but no openai API I guess, so no opencode...

[−] homarp 29d ago
check on same port, there is an OpenAI API https://github.com/ggml-org/llama.cpp/tree/master/tools/serv...
[−] teekert 29d ago
Good stuff, thanx!
[−] zozbot234 29d ago
And that's exactly why llama.cpp is not usable by casual users. They follow the "move fast and break things" model. With ollama, you just have to make sure you're getting/building the latest version.
[−] Eisenstein 29d ago
Its not possible to run the latest model architectures without 'moving fast'. The only thing broken here is that they are trying to use an old version with a new model.
[−] cyanydeez 29d ago
and Ollama suffered the same fate when wanting to try new models
[−] Eisenstein 29d ago
What fate?
[−] cyanydeez 28d ago
the impedance mismatch between when models are released and the capability of Ollama and other servers capability for use.
[−] OtherShrezzing 29d ago
While that might be true, for as long as its name is “.cpp”, people are going to think it’s a C++ library and avoid it.
[−] eterm 29d ago
This is the first I'm learning that it isn't just a C++ library.

In fact the first line of the wikipedia article is:

> llama.cpp is an open source software library

[−] RobotToaster 29d ago
It would make sense to just make the GUI a separate project, they could call it llama.gui.
[−] figassis 29d ago
This is correct, and I avoided it for this reason, did not have the bandwidth to get into any cpp rabbit hole so just used whatever seemed to abstract it away.
[−] marssaxman 28d ago
Wait, it isn't? The name very strongly suggests that it is a text file containing C++ source code; is that not the case?
[−] mijoharas 29d ago
Frankly I think the cli UX and documentation is still much better for ollama.

It makes a bunch of decisions for you so you don't have to think much to get a model up and running.

[−] zombot 29d ago
I don't care about the GUI so much. Ollama lets me download, adjust and run a whole bunch of models and they are reasonably fast. Last time I compared it with Llama.cpp, finding out how to download and install models was a pain in Llama.cpp and it was also _much_ slower than Ollama.
[−] JKCalhoun 29d ago
"LM Studio… Jan… Msty… koboldcpp…"

Plenty of alternatives listed. Can anyone with experience suggest the likely successor to Ollama? I have a Mac Mini but don't mind a C/L tool.

I think, as was pointed out, Ollama won because of how easy it is to set up, pull down new models. I would expect similar for a replacement.

[−] samus 29d ago
How about kobold.cpp then? Or LMStudio (I know it's not open source, but at least they give proper credit to llama.cpp)?

Re curation: they should strive to not integrate broken support for models and avoid uploading broken GGUFs.

[−] ekianjo 29d ago

> For most users that wanted to run LLM locally, ollama solved the UX problem

This does not absolve them from the license violation

[−] omgitspavel 29d ago
agree. We can easily compare it with docker. Of course people can use runc directly, but most people select not to and use docker run instead.

And you can blame docker in a similar manner. LXC existed for at least 5 years before docker. But docker was just much more convenient to use for an average user.

UX is a huge factor for adoption of technology. If a project fails at creating the right interface, there is nothing wrong with creating a wrapper.

[−] well_ackshually 29d ago

>solved the UX problem.

>One command

Notwithstanding the fact that there's about zero difference between ollama run model-name and llama-cpp -hf model-name, and that running things in the terminal is already a gigantic UX blocker (Ollama's popularity comes from the fact that it has a GUI), why are you putting the blame back on an open source project that owes you approximately zero communication ?

[−] Zetaphor 29d ago
I got tired of repeating the same points and having to dig up sources every time, so here's the timeline (as I know it) in one place with sources.
[−] 0xbadcafebee 29d ago
No mention of the fact that Ollama is about 1000x easier to use. Llama.cpp is a great project, but it's also one of the least user friendly pieces of software I've used. I don't think anyone in the project cares about normal users.

I started with Ollama, and it was great. But I moved to llama.cpp to have more up-to-date fixes. I still use Ollama to pull and list my models because it's so easy. I then built my own set of scripts to populate a separate cache directory of hardlinks so llama-swap can load the gguf's into llama.cpp.

[−] u1hcw9nx 29d ago
Two Views of MIT-Style Licenses:

1. MIT-style licenses are "do what you want" as long as you provide a single line of attribution. Including building big closed source business around it.

2. MIT-style licenses are "do what you want" under the law, but they carry moral, GPL-like obligations to think about the "community."

To my knowledge Georgi Gerganov, the creator of llama.cpp, has only complained about attribution when it was missing. As an open-source developer, he selected a permissive license and has not complained about other issues, only the lack of credit. It seems he treats the MIT license as the first kind.

The article has other good points not related to licensing that are good to know. Like performance issues and simplicity that makes me consider llama.cpp.

[−] usernomdeguerre 29d ago
Do they still not let you change the default model folder? You had to go through this whole song and dance to manually register a model via a pointless dockerfile wannabe that then seemed to copy the original model into their hash storage (again, unable to change where that storage lived).

At the time I dropped it for LMStudio, which to be fair was not fully open source either, but at least exposed the model folder and integrated with HF rather than a proprietary model garden for no good reason.

[−] dizhn 29d ago

> the file gets copied into Ollama’s hashed blob storage, you still can’t share the GGUF with other tool

This is the reason I had stopped using it. I think they might be doing it for deduplication however it makes it impossible to use the same model with other tools. Every other tool can just point to the same existing gguf and can go. Whether its their intention or not, it's making it difficult to try out other tools. Model files are quite large as you know and storage and download can become issues. (They are for me)

[−] zxcholmes 29d ago
The name "llama.cpp" doesn't seem very friendly anymore nowadays... Back then, "llama" probably referred to those models from Facebook, and now those Llama series models clearly can't represent the strongest open-source models anymore...
[−] denismi 29d ago
Hmm..

  pacman -Ss ollama | wc -l                                                                                                              
  16
  pacman -Ss llama.cpp | wc -l
  0
  pacman -Ss lmstudio | wc -l
  0
Maybe some day.
[−] zarzavat 29d ago
It's as if Ollama is trying to create a walled garden, but the garden is outside of their property, so all it achieves is walling themselves in.
[−] blueybingo 29d ago
the article buries what's actaully the most practical gotcha: ollama's hashed blob storage means if you've been pulling models for months, switching tools requires re-downloading everything because you can't just point another runtime at those files, and most users won't discover this until they're already invested enough that it genuinely hurts to leave.
[−] flux3125 29d ago
I stopped using Ollama a couple of months ago. Not out of frustration, but because llama.cpp has improved a lot recently with router mode, hot-swapping, a modern and simple web UI, MCP support and lots of other improvements.
[−] FeepingCreature 29d ago
I always avoided Ollama because it smelled like a project that was trying so desperately to own the entire workflow. I guess I dodged a bigger bullet than I knew.
[−] anandkrshnn 17d ago
Thanks for laying out the timeline so clearly. I’ve been using Ollama for months because it ‘just worked,’ but I had no idea about the missing attribution or the hashed blob storage lock-in. The fact that I can’t easily point another tool at the same GGUF file is a dealbreaker for me. Going to try llama-server today.
[−] fy20 29d ago
It feels like a bit of history is missing... If ollama was founded 3 years before llama.cpp was released, what engine did they use then? When did they transition?
[−] song 29d ago
So, on a mac, what good alternative to ollama supports mlx for acceleration? My main use case is that I have an old m1 max macbook pro with 64 gb ram that I use as a model server.
[−] osmsucks 29d ago
I noticed the performance issues too. I started using Jan recently and tried running the same model via llama.cpp vs local ollama, and the llama.cpp one was noticeably faster.
[−] speedgoose 29d ago
I prefer Ollama over the suggested alternatives.

I will switch once we have good user experience on simple features.

A new model is released on HF or the Ollama registry? One ollama pull and it's available. It's underwhelming? ollama rm.

[−] TomGarden 29d ago
The performance issues are crazy. Thanks for sharing this
[−] tosh 29d ago
This is a bit like saying stop using Ubuntu, use Debian instead.

Both llama.cpp and ollama are great and focused on different things and yet complement each other (both can be true at the same time!)

Ollama has great ux and also supports inference via mlx, which has better performance on apple silicon than llama.cpp

I'm using llama.cpp, ollama, lm studio, mlx etc etc depending on what is most convenient for me at the time to get done what I want to get done (e.g. a specific model config to run, mcp, just try a prompt quickly, …)

[−] dragochat 29d ago
how about the others:

- vLLM https://vllm.ai/ ?

- oMLX https://github.com/jundot/omlx ?

[−] tyfon 29d ago
I think the biggest advantage for me with ollama is the ability to "hotswap" models with different utility instead of restarting the server with different models combined with the simple "ollama pull model". In other words, it has been quite convenient.

Due to this post I had to search a bit and it seems that llama.cpp recently got router support[1], so I need to have a look at this.

My main use for this is a discord bot where I have different models for different features like replying to messages with images/video or pure text, and non reply generation of sentiment and image descriptions. These all perform best with different models and it has been very convenient for the server to just swap in and out models on request.

[1] https://huggingface.co/blog/ggml-org/model-management-in-lla...

[−] alfiedotwtf 29d ago
I'm a llama.cpp user, but apart from the MIT licensing issue, I personally don't see what's the problem here is? Sure Ollama could have advertised better that llama.cpp was it's original backend, but were they obligated to? It's no different to Docker or VMWare that hitch a ride on kernel primitives etc.
[−] MyUltiDev 28d ago
The attribution and lock-in arguments are the loud parts of this story, but the quieter production reason to move is concurrency. llama.cpp's server takes parallel N with cont-batching enabled by default, which interleaves tokens from multiple requests inside a single batch and keeps the GPU busy. Ollama defaults its parallel slots low and the interaction is less transparent, so the first time three people share a single model instance you feel it before any of the ethics become relevant. For a 70B Q4_K_M on a workstation, the real ceiling is KV cache fragmentation, and you have to size the context window around the parallel count rather than around one user. What is the highest parallel value anyone here has kept stable on a 70B Q4_K_M before the cache eviction pattern starts hurting quality?
[−] shantnutiwari 29d ago
Just tried llama.cpp

NO, it is not simpler or even as simple as Ollama.

There are multiple options-- llama server and cli, its not obivous which model to use.

With ollama, its one file. And you get the models from their site, you can browse an easy list.

I dont have the time to go thru 20billlion hugging face models and decide which is the one for me.

Thanks, but I'm sticking with Ollama

[−] utopiah 29d ago
Not sure why VLC doesn't do that.

It's a joke... but also not really? I mean VLC is "just" an interface to play videos. Videos are content files one "interact" with, mostly play/pause and few other functions like seeking. Because there are different video formats VLC relies on codecs to decode the videos, so basically delegating the "hard" part to codecs.

Now... what's the difference here? A model is a codec, the interactions are sending text/image/etc to it, output is text/image/etc out. It's not even radically bigger in size as videos can be huge, like models.

I'm confused as why this isn't a solved problem, especially (and yes I'm being a big sarcastic here, can't help myself) in a time where "AI" supposedly made all smart wise developers who rely on it 10x or even 1000x more productive.

Weird.

[−] mrkeen 29d ago

> Red Hat’s ramalama is worth a look too, a container-native model runner that explicitly credits its upstream dependencies front and center. Exactly what Ollama should have done from the start.

  % ramalama run qwen3.5-9b
  Error: Manifest for qwen3.5-9b:latest was not found in the Ollama registry
[−] bashbjorn 28d ago
Oh hey I'm also working on a thing to solve the devx of llama.cpp: https://github.com/nobodywho-ooo/nobodywho

In contrast to Ollama, this is a self-contained library, not a server.

I wrote some quick notes on this blogpost, just to jot down how we think about good open-source citizenship: https://www.nobodywho.ai/posts/notes-on-friends-dont-let-fri...

[−] damnitbuilds 29d ago
I am trying to run models that are on the edge of what my hardware can support. I guess many people are.

So given, as the author states, Ollama runs the LLMs inefficiently, what is the tool that runs them most efficiently on limited hardware ?

[−] pplonski86 29d ago
I like Ollama Cloud service (I'm paid pro user), because it let me test several open source LLMs very fast - I dont need to download anything locally, just change the model name in the API. If I like the model then I can download it and run locally with sensitive data. I also like their CLI, because it is simple to use.

The fact that they are trying to make money is normal - they are a company. They need to pay the bills.

I agree that they should improve communication, but I assume it is still small company with a lot of different requests, and some things might be overlooked.

Overall I like the software and services they provide.

[−] mentalgear 29d ago

> Ollama is a Y Combinator-backed (W21) startup, founded by engineers who previously built a Docker GUI that was acquired by Docker Inc. The playbook is familiar: wrap an existing open-source project in a user-friendly interface, build a user base, raise money, then figure out monetization.

    The progression follows the pattern cleanly:

    1. Launch on open source, build on llama.cpp, gain community trust
    2. Minimize attribution, make the product look self-sufficient to investors
    3. Create lock-in, proprietary model registry format, hashed filenames that don’t work with other tools
    4. Launch closed-source components, the GUI app
    5. Add cloud services, the monetization vector
[−] rrhjm53270 29d ago
It is a bit off-topic, but would it possible to provide a light mode for this blog? I used to work during the day time, and my pupils had to contract to read, making it a very poor reading experience.
[−] erusev 29d ago
This is partly why we're building LlamaBarn. It's a lightweight macOS menu bar app that runs llama-server under the hood, with models stored as standard GGUFs in your Hugging Face cache — the same location llama-server uses by default. No separate model store, no lock-in.

https://github.com/ggml-org/LlamaBarn

[−] g023 28d ago
I've started creating https://github.com/g023/localmodelrouter/ which offers Ollama like functionality but as a single .py file with minimal dependencies and more focus on letting llama.cpp handle the dirty work.
[−] NamlchakKhandro 29d ago
LM Studio is 1000x easier to use than ollama btw
[−] endymion-light 29d ago
I'm sorry, on a mac, Ollama just works. It lets me use a model and test it quickly. This is like saying stop using google drive, upload everything to s3 instead!

When i'm using Ollama - I honeslty don't care about performance, I'm looking to try out a model and then if it seems good, place it onto a most dedicated stack specifically for it.

[−] thot_experiment 29d ago
I was pretty big on ollama, it seemed like a great default solution. I had alpha that it was a trash organization but I didn't listen because I just liked having a reliable inference backend that didn't require me to install torch. I switched to llama.cpp for everything maybe 6 months ago because of how fucking frustrating every one of my interactions with ollama (the organization) were. I wanna publicly apologize to everyone who's concerns I brushed off. Ollama is a vampire on the culture and their demise cannot come soon enough.

FWIW llama.cpp does almost everything ollama does better than ollama with the exception of model management, but like, be real, you can just ask it to write an API of your preferred shape and qwen will handle it without issue.

[−] iib 29d ago
Has anybody figured some of the best flags to compile llama.cpp for rocm? I'm using the framework desktop and the Vulkan backend, because it was easier to compile out of the box, but I feel there's large peformance gains on the table by swtiching to rocm. Not sure if installing with brew on ubuntu would be easier.
[−] DeathArrow 29d ago
I see no mention of vLLM in the article.
[−] san_tekart 29d ago
The CLI is great locally, but the architecture fights you in production. Putting a stateful daemon that manages its own blob storage inside a container is a classic anti-pattern. I ended up moving to a proper stateless binary like llama-server for k8s.
[−] rothific 29d ago
I've been experimenting with running Gemma with MLX directly within my own harness: https://github.com/cjroth/mlx-harness
[−] WhereIsTheTruth 29d ago
The state of LLM as a service is just depressing

It is a parasitic stack that redirects investment into service wrappers while leaving core infrastructure underfunded

We have to suffer with limits and quotas as if we are living in the Soviet Union

[−] rement 29d ago
I switched to using LlamaBarn to manage local models on macOS.

https://github.com/ggml-org/llamabarn

[−] dhruv3006 29d ago
ollama is pretty intuitive to use still - dont see why will stop.