April 2026 TLDR Setup for Ollama and Gemma 4 26B on a Mac mini (gist.github.com)

by greenstevester 123 comments 330 points
Read article View on HN

123 comments

[−] Aurornis 42d ago
If this is your first time using open weight models right after release, know that there are always bugs in the early implementations and even quantizations.

Every project races to have support on launch day so they don’t lose users, but the output you get may not be correct. There are already several problems being discovered in tokenizer implementations and quantizations may have problems too if they use imatrix.

So you’re going to see a lot of “I tried it but it sucks because it can’t even do tool calls” and other reports about how the models don’t work at all in the coming weeks from people who don’t realize they were using broken implementations.

If you want to try cutting edge open models you need to be ready to constantly update your inference engine and check your quantization for updates and re-download when it’s changed. The mad rush to support it on launch day means everything gets shipped as soon as it looks like it can produce output tokens, not when it’s tested to be correct.

[−] colechristensen 42d ago
You seem like you know what you're talking about... what inference engine should I use? (linux, 4090)

I keep having "I tried it but it sucks" issues mostly around tool calling and it's not clear if it's the model or ollama. And not one model in particular, any of them really.

[−] embedding-shape 42d ago
For the specific issue parent is talking about, you really need to give various tools a try yourself, and if you're getting really shit results, assume it's the implementation that is wrong, and either find an existing bug tracker issue or create a new one.

Same thing happened when GPT-OSS launched, bunch of projects had "day-1" support, but in reality it just meant you could load the model basically, a bunch of them had broken tool calling, some chat prompt templates were broken and so on. Even llama.cpp which usually has the most recent support (in my experience) had this issue, and it wasn't until a week or two after llama.cpp that GPT-OSS could be fairly evaluated with it. Then Ollama/LM Studio updates their llama.cpp some days after that.

So it's a process thing, not "this software is better than that", and it heavily depends on the model.

[−] kamranjon 42d ago
I've had really good success with LMStudio and GLM 4.7 Flash and the Zed editor which has a baked in integration with LMStudio. I am able to one-shot whole projects this way, and it seems to be constantly improving. Some update recently even allowed the agent to ask me if it can do a "research" phase - so it'll actually reach out to website and read docs and code from github if you allow it. GLM 4.7 flash has been the most adept at tool calling I've found, but the Qwen 3 and 3.5 models are also fairly good, though run into more snags than I've seen with GLM 4.7 flash.
[−] Aurornis 42d ago
I don’t know if any of engines are fully tested yet.

For new LLMs I get in the habit of building llama.cpp from upstream head and checking for updated quantizations right before I start using it. You can also download llama.cpp CI builds from their release page but on Linux it’s easy to set up a local build.

If you don’t want to be a guinea pig for untested work then the safe option would be to wait 2-3 weeks

[−] accrual 42d ago
For me, LM Studio on Fedora + Gemma 4 didn't work yesterday afternoon with the release, but worked this morning after the runtimes updated. In fact - there are new runtime updates now as I check again.
[−] vardalab 42d ago
just use openrouter or google ai playground for the first week till bugs are ironed out. You still learn the nuances of the model and then yuu can switch to local. In addition you might pickup enough nuance to see if quantization is having any effect
[−] logicallee 42d ago
In case someone would like to know what these are like on this hardware, I tested Gemma 4 32b (the ~20 GB model, the largest Gemma model Google published) and Gemma 4 gemma4:e4b (the ~10 GB model) on this exact setup (Mac Mini M4 with 24 GB of RAM using Ollama), I livestreamed it:

https://www.youtube.com/live/G5OVcKO70ns

The ~10 GB model is super speedy, loading in a few seconds and giving responses almost instantly. If you just want to see its performance, it says hello around the 2 minute mark in the video (and fast!) and the ~20 GB model says hello around 5 minutes 45 seconds in the video. You can see the difference in their loading times and speed, which is a substantial difference. I also had each of them complete a difficult coding task, they both got it correct but the 20 GB model was much slower. It's a bit too slow to use on this setup day to day, plus it would take almost all the memory. The 10 GB model could fit comfortably on a Mac Mini 24 GB with plenty of RAM left for everything else, and it seems like you can use it for small-size useful coding tasks.

[−] neo_doom 42d ago
Huge Claude user here… can someone help me set some realistic expectations if I bought a Mac mini and spun one up? I use Claude primarily for dev work and Home Lab projects. Are the open models good enough to run locally and replace the Claude workload? Or am I better off with my $20/mo Claude subscription?
[−] milchek 42d ago
I tested briefly with a MacBook Pro m4 with 36gb. Run in LM Studio with open code as the frontend and it failed over and over on tool calls. Switched back to qwen. Anyone else on similar setup have better luck?
[−] anonyfox 42d ago
M5 air here with 32gb ram and 10/10 cores. Anyone got some luck with mlx builds on oMLX so far? Not at my machine right now and would love to know if these models already work including tool calling
[−] jasonriddle 42d ago
Slightly off topic, but question for folks.

I'm hoping to replace coding with Claude Sonnet 4.5 with a model with an open source or open weights model. Are any of the models on Ollama.com cloud offering (https://ollama.com/search?c=cloud) or any of the models on OpenRouter.ai a close replacement? I know that no model right now matches the full performance and capabilities of Claude Sonnet 4.5, but I want to know how close I can get and with which model(s).

If there is a model you say can replace it, talk about how long you have been using it for, and using what harness (Claude code, opencode, etc), and some strengths and weakness you have noticed. I'm not interested in what benchmarks say, I want to hear about real world use from programmers using these models.

[−] spencer-p 42d ago
Weird that the steps are for "Gemma 4 12b", which does not exist, and then switches to 26b midway through.

There's also a step to verify that it doesn't fit on the GPU with ollama ps showing "14%/86% CPU/GPU". Doesn't this mean you'll have really bad performance?

[−] pwr1 42d ago
Running 26B locally is impressive but the latency math gets rough once your doing anything beyond chat. We switched from local inference to API calls for image generation specifically because cold start + generation time on consumer hardware made it impractical for any kind of automated workflow.

Local is great for experimentation but production workloads that need to run reliably at specific times still favor API imo. That said for privacy sensitive use cases where data cant leave the machine, setups like this are invaluable.

[−] easygenes 42d ago
Why is ollama so many people’s go-to? Genuinely curious, I’ve tried it but it feels overly stripped down / dumbed down vs nearly everything else I’ve used.

Lately I’ve been playing with Unsloth Studio and think that’s probably a much better “give it to a beginner” default.

[−] aetherspawn 42d ago
Which harness (IDE) works with this if any? Can I use it for local coding right now?
[−] boutell 42d ago
Last night I had to install the VO.20 pre-release of ollama to use this model. So I'm wondering if these instructions are accurate.
[−] redrove 42d ago
There is virtually no reason to use Ollama over LM Studio or the myriad of other alternatives.

Ollama is slower and they started out as a shameless llama.cpp ripoff without giving credit and now they “ported” it to Go which means they’re just vibe code translating llama.cpp, bugs included.

[−] kristopolous 42d ago
Are you getting tool call and multimodal working? I don't see it in the quantized unsloth ggufs...
[−] Xentyon 42d ago
Nice setup. Running models locally on Mac hardware has gotten surprisingly viable. I'm using a similar stack in Switzerland for testing AI agent workflows — the M-series chips handle inference well for tool-calling tasks.
[−] amelius 42d ago
Has anyone tried to run it on a Jetson Orin AGX with 64GB unified memory?
[−] OkGoDoIt 42d ago
Sorry for being off topic, but why can’t I open this without being logged into GitHub? I thought gists are either completely private or publicly accessible. Are they no longer publicly accessible?
[−] zachperkel 42d ago
how many TPS does a build like this achieve on gemma 4 26b?
[−] renewiltord 42d ago
Just told Claude to sort it out and it ran it. 26 tok/s on the Mac mini I use for personal claw type program. Unusable for local agent but it’s okay.
[−] kilzimir 42d ago
Kinda crazy that I can run a 26B model on a 1500€ laptop (MacBook Air M5 32GB). Does anyone know how I can actually use this in a productive way?
[−] robotswantdata 42d ago
Why are you using Ollama? Just use llama.cpp

brew install llama.cpp

use the inbuilt CLI, Server or Chat interface. + Hook it up to any other app

[−] techpulselab 42d ago
[dead]
[−] aplomb1026 42d ago
[dead]
[−] aplomb1026 42d ago
[dead]
[−] aimemobe 39d ago
[flagged]
[−] jiusanzhou 42d ago
[dead]
[−] volume_tech 42d ago
[dead]
[−] kanehorikawa 42d ago
[dead]
[−] greenstevester 42d ago
[flagged]
[−] mark_l_watson 42d ago
The article has a few good tips for using Ollama. Perhaps it should note that the Gemma 4 models are not really trained for strong performance with coding agents like OpenCode, Claude Code, pi, etc. The Gemma 4 models are excellent for applications requiring tool use, data extraction to JSON, etc. I asked Gemini Pro about this earlier and Gemini Pro recommended qwen 3.5 models specifically for coding, and backed that up with interesting material on training. This makes sense, and is something that I do: use strong models to build effective applications using small efficient models.