I ran Gemma 4 as a local model in Codex CLI (blog.danielvaughan.com)

by dvaughan 116 comments 285 points
Read article View on HN

116 comments

[−] gertlabs 32d ago
Gemma 4 26B really is an outlier in its weight class.

In our little known, difficult to game benchmarks, it scored about as well as GPT 5.2 and Gemini 3 Pro Preview on one-shot coding problems. It had me re-reviewing our entire benchmarking methodology.

But it struggled in the other two sections of our benchmark: agentic coding and non-coding decision making. Tool use, iterative refinement, managing large contexts, and reasoning outside of coding brought the scores back down to reality. It actually performed worse when it had to use tools and a custom harness to write code for an eval vs getting the chance to one-shot it. No doubt it's been overfit on common harnesses and agentic benchmarks. But the main problem is likely scaling context on small models.

Still, incredible model, and incredible speed on an M-series Macbook. Benchmarks at https://gertlabs.com

[−] neonstatic 31d ago
I have very mixed feelings about that model. I want to like it. It's very fast and seems to be fit for many uses. I strongly dislike its "personality", but it responds well to system prompts.

Unfortunately, my experience with it as a coding assistant is very poor. It doesn't understand libraries it seems to know about, it doesn't see root causes of problems I want it to solve, and it refuses to use MCP tools even when asked. It has a very strong fixation on the concept of time. Anything past January 2025, which I think is its knowledge cutoff, the model will label as "science fiction" or "their fantasy" and role play from there.

[−] seemaze 32d ago
Thats funny, it failed my usual ‘hello world’ benchmark for LLM’s:

“Write a single file web page that implements a 1 dimensional bin fitting calculator using the best fit decreasing algorithm. Allow the user to input bin size, item size, and item quantity.”

Qwen3.5, Nematron, Step 3.5, gpt-oss all passed first go..

[−] datadrivenangel 32d ago
Overall it's a very good open weights model! Notably I thought it makes more dumb coding mistakes than GPT-OSS on my M5, but it's fairly close overall.
[−] prettyblocks 31d ago
For me the vision/OCR is much better than other models in weights class.
[−] iknowstuff 32d ago
Gemma 31B scoring below 26B-A4B?
[−] gertlabs 32d ago
In one shot coding, surprisingly, yes, by a decent amount. And it isn't a sample size issue. In agentic, no: https://gertlabs.com/?agentic=agentic

My early takeaway is that Gemma 26B-A4B is the best tuned out of the bunch, but being small and with few active params, it's severely constrained by context (large inputs and tasks with large required outputs tank Gemma 26B's performance). We're working on a clean visualization for this; the data is there.

It's not uncommon for a sub-release of a model to show improvements across the board on its model card, but actually have mixed real performance compared to its predecessor (sometimes even being worse on average).

[−] adrian_b 31d ago
In early tests the performance of gemma-4-31B was affected by tokenizer bugs in many of the existing backends, like llama.cpp, which were later corrected by their maintainers.

Moreover, tool invocation had problems that were later corrected by Google in an updated chat template.

So any early benchmarks that have shown the dense model as inferior to the MoE model are likely to be flawed and they must be repeated after updating both the inference backend and the model.

All benchmarks that I have seen after the bugs were fixed have shown the dense model as clearly superior in quality, even if much slower.

[−] gertlabs 31d ago
We add samples every week, so I'm curious if the numbers will move.

They did a similar re-release during the Gemini 3.1 Pro Preview rollout, and released a custom-tools version with its own slug, which performs MUCH better on custom harnesses (mostly because the original release could not figure out tool call formatting at all).

[−] mhitza 32d ago

> The finding I did not expect: model quality matters more than token speed for agentic coding.

I'm really surprised how that was not obvious.

Also, instead of limiting context size to something like 32k, at the cost of ~halving token generation speed, you can offload MoE stuff to the CPU with --cpu-moe.

[−] triceratops 32d ago
Why would token speed matter for anything other than getting work done faster? It's in the name - "speed".
[−] dminik 32d ago
This would be true if the models were capable of always completing the tasks. But, since their failure rate is fairly high, going in a wrong direction for longer could mean that you take more time than a faster model, where you can spot it going wrong earlier.
[−] dangoodmanUT 32d ago
Yeah, it’s like drinking coffee when being really tired. You’re still tired, just “faster”, it’s a weird sensation.
[−] kingstnap 32d ago
It's even more strange how its not obvious to someone who uses codex extensively daily.

The rate limiting step is the LLM going down stupid rabbit holes or overthinking hard and getting decision paralysis.

The only time raw speed really matters is if you are trying to add many many lines of new code. But if you are doing that at token limiting rates you are going to be approaching the singularity of AI slop codebase in no time.

[−] adam_patarino 32d ago
[dead]
[−] tuzemec 32d ago
I'm currently experimenting with running google/gemma-4-26b-a4b with lm studio (https://lmstudio.ai/) and Opencode on a M3 Ultra with 48Gb RAM. And it seems to be working. I had to increase the context size to 65536 so the prompts from Opencode would work, but no other problems so far.

I tried running the same on an M3 Max with less memory, but couldn't increase the context size enough to be useful with Opencode.

It's also easy to integrate it with Zed via ACP. For now it's mostly simple code review tasks and generating small front-end related code snippets.

[−] usagisushi 32d ago
I have a similar setup. It might be worth checking out pi-coding-agent [0].

The system prompt and tools have very little overhead (<2k tokens), making the prefill latency feel noticeably snappier compared to Opencode.

[0] https://www.npmjs.com/package/@mariozechner/pi-coding-agent#...

[−] fortyseven 32d ago
I've been VERY impressed with Gemma4 (26B at the moment). It's the first time I've been able to use OpenCode via a llamacpp server reliably and actually get shit done.

In fact, I started using it as a coding partner while learning how to use the Godot game engine (and some custom 'skills' I pulled together from the official docs). I purposely avoided Claude and friends entirely, and just used Gemma4 locally this week... and it's really helped me figure out not just coding issues I was encountering, but also helped me sift through the documentation quite readily. I never felt like I needed to give in and use Claude.

Very, very pleased.

[−] segmondy 32d ago
"The reason I had not done this before is that local models could not call tools. "

Rubbish, we have been calling tools locally for 2 years, and it's very false that gemma3 scored under 7% in tool calling. Hell, I was getting at least 75% tool calling with llama3.3

[−] egorfine 32d ago
Related: I have upgraded my M4 Pro 24GB to M5 Pro 48GB yesterday. The same Gemma 4 MoE model (Q4) runs about 8x more t/s on M5 Pro and loads 2x times faster from disk to memory.

Gonna run some more tests later today.

[−] zihotki 32d ago
For coding it makes no sense to use any quantization worse than Q6_K, from my experience. More quantized models make more mistakes and if for text processing it still can be fine, for coding it's not.
[−] meander_water 32d ago
I would have liked to see quality results between the different quantization methods - Q4_K_M, Q_8_0, Q_6_K rather than tok/s
[−] dajonker 32d ago
I don't really have the hardware to try it out, but I'm curious to see how Qwen3.5 stacks up against Gemma 4 in a comparison like this. Especially this model that was fine tuned to be good at tool calling that has more than 500k downloads as of this moment: https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-...
[−] 2001zhaozhao 32d ago
I think it might be a good idea to make some kind of local-first harness that is designed to fully saturate some local hardware churning experiments on Gemma 4 (or another local model) 24/7 and only occasionally calls Claude Opus for big architectural decisions and hard-to-fix bugs.

Something like:

* Human + Claude Opus sets up project direction and identifies research experiments that can be performed by a local model

* Gemma 4 on local hardware autonomously performs smaller research experiments / POCs, including autonomous testing and validation steps that burn a lot of tokens but can convincingly prove that the POC works. This is automatically scheduled to fully utilize the local hardware. There might even be a prioritization system to make these POC experiments only run when there's no more urgent request on the local hardware. The local model has an option to call Opus if it's truly stuck on a task.

* Once an approach is proven through the experimentation, human works with Opus to implement into main project from scratch

If you can get a complex harness to work on models of this weight-class paired with the right local hardware (maybe your old gaming GPU plus 32gb of RAM), you can churn through millions of output tokens a day (and probably like ~100 million input tokens though the vast majority are cached). The main cost advantage compared to cloud models is actually that you have total control over prompt caching locally which makes it basically free, whereas most API providers for small LLM models ask for full price for input tokens even if the prompt is exactly repeated across every request.

[−] blackmanta 32d ago
With a nvidia spark or 128gb+ memory machine, you can get a good speed up on the 31B model if you use the 26B MoE as a draft model. It uses more memory but I’ve seen acceptance rate at around 70%+ using Q8 on both models
[−] taf2 32d ago
I did this with qwen 3.5 - tool calling was the biggest issue but for getting it to work with vllm and mlx I just asked codex to help. The bulk of my the time was waiting on download. For vllm it created a proxy service to translate some codex idioms to vllm and vice versa. In practice I got good results on my first prompt but followup questions usually would fail due to the models trouble with tool calling - I need to try again with gemma4
[−] cjbgkagh 32d ago
I've been playing with this for the last few days. The model is fast, pretty smart, and I am hitting the same tool use issues. This blog post is unusually pertinent. The model speed isn't an issue on my dual 4090s, the productivity is mainly limited by the intelligence (while high it's still not high enough for some tasks) and getting stuck in loops.

What I would like is for it to be able detect when these things happen and to "Phone a Friend" to a smarter model to ask for advice.

I'm definitely moving into agent orchestration territory where I'll have an number of agents constantly running and working on things as I am not the bottleneck. I'll have a mix of on-prem and AI providers.

My role now is less coder and more designer / manager / architect as agents readily go off in tangents and mess that they're not smart enough to get out of.

[−] vsrinivas 32d ago
Hey - I use the same, w/ both gemma4 and gpt-oss-*; some things I have to do for a good experience:

1) Pin to an earlier version of codex (sorry) - 0.55 is the best experience IME, but YMMV (see https://github.com/openai/codex/issues/11940, https://github.com/openai/codex/issues/8272).

2) Use the older completions endpoint (llama.cpp's responses support is incomplete - https://github.com/ggml-org/llama.cpp/issues/19138)

[−] magic_hamster 32d ago
Ollama is the worst engine you could use for this. Since you are already running on an Nvidia stack for the dense model, you should serve this with vLLM. With 128GB you could try for the original safetensors even though you might need to be careful with caches and context length.
[−] bitwize 32d ago
I recently spun up Gemma 4 26B-A4B on my local box and pointed OpenCode at it, and it did reasonably well! My machine is 8 years old, though I had the foresight to double the RAM to 32 GiB before the RAMpocalypse, and I can get a little bit of GPU oomph but not a lot, not with a mere GTX 1070. So it's slow, and nowhere near frontier model quality, but it can generate reasonable code and is good for faffing with!
[−] brcmthrowaway 32d ago
Nothing about omlx?
[−] OutOfHere 32d ago
Gemma 4 is a strongly censored model, so much so that it refused to answer medical and health related questions, even basic ones. No one should be using it, and if this is the best that Google can do, it should stop now. Other models do not have such ridiculous self-imposed problems.
[−] OsrsNeedsf2P 32d ago
I laughed when I saw the .md table rendering as a service. Blows my mind what people will use
[−] flux3125 32d ago
In my experience if you're coding or doing something that requires precision, quantizing the kv cache is definitely not worth it.

If you're just chatting or doing less precise things it's 1000% worth it going down to Q8 or sometimes even Q4

[−] danilop 32d ago
Nice walkthrough and interesting findings! The difference between the MoE and the dense models seems to be bigger than what benchmarks report. It makes sense because a small gain in toll planning and handling can have a large influence on results.
[−] anactofgod 32d ago
Amazing. Thanks for your detailed posts on the bake-off between the Mac and GB10, Daniel, and on your learnings. I had trying similar on both compute platforms on my to-do list. Your post should save me a lot of debugs, sweat, and tears.
[−] karpetrosyan 32d ago
I think local models are not yet that good or fast for complex things, so I am just using local Gemma 4 for some dummy refactorings or something really simple.
[−] mudkipdev 32d ago
Does the large system prompt work fine for this model? If needed, you could use a lightweight CLI like Pi, which only comes with 4 tools by default
[−] dpoloncsak 31d ago
Is it really fair to benchmark LLMs on a Mac using ollama and not MLX? Does ollama make proper use of the M-series yet?
[−] alvsilvao 32d ago
I also tried Gemma 4 on a M1 Macbook Pro. It worked but it was too slow. Great to know that it works on more advanced laptops!
[−] Havoc 32d ago
You can also try speculative decoding with the E2B model. Under some conditions it can result in a decent speed up