Ollama is now powered by MLX on Apple Silicon in preview (ollama.com)

by redundantly 354 comments 648 points
Read article View on HN

354 comments

[−] abu_ameena 45d ago
On-device models are the future. Users prefer them. No privacy issues. No dealing with connectivity, tokens, or changes to vendors implementations. I have an app using Foundation Model, and it works great. I only wish I could backport it to pre macOS 26 versions.
[−] raw_anon_1111 45d ago
Users don’t care about “privacy”. If they did, Meta and Alphabet wouldn’t be worth $1T+.

Users really don’t matter at all. The revenue for AI companies will be B2B where the user is not the customer - including coding agents. Most people don’t even use computers as their primary “computing device” and most people are buying crappy low end Android phones - no I’m not saying all Android phones are crappy. But that’s what most people are buying with the average selling price of an Android phone being $300.

[−] sowbug 45d ago
I am concerned that local models will never benefit from the training on live requests that is surely improving cloud-only models.

This might be the cost of privacy, and it might be worth paying, unless cloud models reach an inflection point that make local models archaic.

[−] throwawayq3423 45d ago
Technologists make the same mistake over and over in thinking the better technology will win. vhs vs betamax, etc.

Actual consumers not only don't care, they will not even be aware of the difference.

[−] mrinterweb 45d ago
I think two recent advances make your statement more true. The new Qwen 3.5 series has shown a relatively high intelligence density, and Google's new turboquant could result in dramatically smaller/efficient models without the normal quantization accuracy tradeoff.

I would expect consumer inference ASIC chips will emerge when model developments start plateauing, and "baking" a highly capable and dense model to a chip makes economic sense.

[−] testing22321 45d ago
I see all these LLM posts about if a certain model can run locally on certain hardware and I don’t get it.

What are you doing with these local models that run at x tokens/sec.

Do you have the equivalent of ChatGPT running entirely locally? What do you do with it? Why? I honestly don’t understand the point or use case.

[−] jesse23 45d ago
Yes so far do we have a working practice that, with a given local mode, any infra we could use, that provide a good practice that can leverage it for local task?
[−] whazor 45d ago
Obviously hardware wise the real blocker is memory cost. But there is no reason why future devices couldn't bundle 256GB of mem by default.
[−] babblingfish 46d ago
LLMs on device is the future. It's more secure and solves the problem of too much demand for inference compared to data center supply, it also would use less electricity. It's just a matter of getting the performance good enough. Most users don't need frontier model performance.
[−] Yukonv 46d ago
Good to see Ollama is catching up with the times for inference on Mac. MLX powered inference makes a big difference, especially on M5 as their graphs point out. What really has been a game changer for my workflow is using https://omlx.ai/ that has SSD KV cold caching. No longer have to worry about a session falling out of memory and needing to prefill again. Combine that with the M5 Max prefill speed means more time is spend on generation than waiting for 50k+ content window to process.
[−] franze 46d ago
I created "apfel" https://github.com/Arthur-Ficial/apfel a CLI for the apple on-device local foundation model (Apple intelligence) yeah its super limited with its 4k context window and super common false positives guardrails (just ask it to describe a color) ... bit still ... using it in bash scripts that just work without calling home / out or incurring extra costs feels super powerful.
[−] LuxBennu 46d ago
Already running qwen 70b 4-bit on m2 max 96gb through llama.cpp and it's pretty solid for day to day stuff. The mlx switch is interesting because ollama was basically shelling out to llama.cpp on mac before, so native mlx should mean better memory handling on apple silicon. Curious to see how it compares on the bigger models vs the gguf path
[−] domh 46d ago
I have an M4 Max with 48GB RAM. Anyone have any tips for good local models? Context length? Using the model recommended in the blog post (qwen3.5:35b-a3b-coding-nvfp4) with Ollama 0.19.0 and it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world". Is this the best that's currently achievable with my hardware or is there something that can be configured to get better results?
[−] codelion 46d ago
How does it compare to some of the newer mlx inference engines like optiq that support turboquantization - https://mlx-optiq.pages.dev/
[−] xmddmx 45d ago
On a M4 Pro MacBook Pro with 48GB RAM I did this test:

ollama run $model "calculate fibonacci numbers in a one-line bash script" --verbose

  Model                         PromptEvalRate EvalRate
  ------------------------------------------------------
  qwen3.5:35b-a3b-q4_K_M         6.6            30.0
  qwen3.5:35b-a3b-nvfp4         13.2            66.5
  qwen3.5:35b-a3b-int4          59.4            84.4

I can't comment on the quality differences (if any) between these three.
[−] bwfan123 45d ago
What is the cheapest usable local rig for coding ? I dont want fancy agents and such, but something purpose built for coders, and fast-enough for my use, and open-source, so I can tweak it to my liking. Things are moving fast, and I am hesitant to put in 3-4K now in the hope that it would be cheaper if i wait.
[−] a-dub 45d ago
is local llm inference on modern macbook pros comfortable yet? when i played with it a year or so ago, it worked fairly ok but definitely produced uncomfortable levels of heat.

(regarding mlx, there were toolkits built on mlx that supported qlora fine tuning and inference, but also produced a bunch of heat)

[−] dial9-1 46d ago
still waiting for the day I can comfortably run Claude Code with local llm's on MacOS with only 16gb of ram
[−] daveorzach 46d ago
What are significant differences between Ollama and LM Studio now? I haven’t used Ollama because it was missing MLX when I started using LLM GUIs.
[−] adolph 45d ago
Much of the discussion here is local versus remote. I like seeing things as "and" and "or." There will be small things I don't want to burn my Claude tokens on and other things that I want to access larger compute resources. And along the way checking results from both to understand comparative advantage on an ongoing basis.
[−] robotswantdata 46d ago
Why are people still using Ollama? Serious.

Lemonade or even llama.cpp are much better optimised and arguably just as easy to use.

[−] mfa1999 46d ago
How does this compare to llama.cpp in terms of performance?
[−] jwr 45d ago
Two things: 1) MLX has been available in LM Studio for a long time now, 2) I found that GGUF produced consistently better results in my benchmarking. The difference isn't big, but it's there.
[−] harel 46d ago
What would be the non Mac computer to run these models locally at the same performance profile? Any similar linux ARM based computers that can reach the same level?
[−] jedisct1 46d ago
Works really great with https://swival.dev and qwen3.5.
[−] dev_l1x_be 46d ago

> Please make sure you have a Mac with more than 32GB of unified memory. Time for an upgrade I guess. If I can run Qwen3.5 locally than it is time to switch over to local first LLM usage.

[−] janandonly 46d ago

>

Please make sure you have a Mac with more than 32GB of unified memory.

Yeah, I can still save money by buying a cheaper device with less RAM and just paying my PPQ.AI or OpenRouter.com fees .

[−] braum 45d ago
How does Ollama help with Claude Code? Claude code runs in terminal but AFAIK connects back to anthropic directly and cannot run locally. I hope I'm missing something obvious.
[−] androiddrew 46d ago
Get turboquant 4 bit implemented and this would be game changer.
[−] harrouet 46d ago
As being on the market for a new mac and comparing refub M4 Max vs M5 _Pro_, I am interested in how much faster the neural engines are -- compared to marketing claims.
[−] jiehong 45d ago
This is excellent news!

What I'm waiting for next is MLX supported speech recognition directly from Ollama. I don’t understand why it should be a separate thing entirely.

[−] rurban 45d ago
Does that mean they are now finally a bit faster than llama.cpp? Cannot believe that.
[−] ranjeethacker 45d ago
I used today, working nicely.
[−] brcmthrowaway 46d ago
What is the difference between Ollama, llama.cpp, ggml and gguf?
[−] puskuruk 46d ago
Finally! My local infra is waiting for it for months!
[−] pyinstallwoes 45d ago
What’s the best local coding model these days?
[−] darshanmakwana 46d ago
Really nice to see this!
[−] obelai 45d ago
[dead]
[−] techpulselab 46d ago
[flagged]
[−] techpulselab 45d ago
[flagged]
[−] charlotte12345 46d ago
[dead]
[−] firekey_browser 46d ago
[dead]
[−] skwon816 45d ago
[dead]
[−] noritaka88 45d ago
[flagged]
[−] charlotte12345 46d ago
[flagged]
[−] AugSun 46d ago
"We can run your dumbed down models faster":

#The use of NVFP4 results in a 3.5x reduction in model memory footprint relative to FP16 and a 1.8x reduction compared to FP8, while maintaining model accuracy with less than 1% degradation on key language modeling tasks for some models.

[−] DevKoan 45d ago
The Foundation Model point is real. As an iOS developer, what excites me most isn't the performance — it's what on-device inference does to the app architecture.

When you're not making network calls, you stop thinking in "loading states" and start thinking in "local state machines." The UX design space opens up completely. Interactions that felt too fast to justify a server round-trip are suddenly viable.

The backporting issue is painful though. I've been shipping features wrapped in #available(iOS 26, *) and the fallback UX is basically a different product. It forces you to essentially maintain two app experiences.

Still think this is the right direction — especially for junior devs just learning to ship. Fewer moving parts, less infrastructure to debug.