You can use llama.cpp server directly to serve local LLMs and use them in Claude Code or other CLI agents. I’ve collected full setup instructions for Gemma4 and other recent open-weight LLMs here, tested on my M1 Max 64 GB MacBook:
The 26BA4B is the most interesting to run on such hardware, and I get nearly double the token-gen speed (40 tok/s) compared to Qwen3.5 35BA3B. However the tau2 bench results[1] for this Gemma4 variant lag far behind the Qwen variant (68% vs 81%), so I don’t expect the former to do well on heavy agentic tool-heavy tasks:
Did you have any Anthropic vs OpenAI specification issues with Claude Code? I have been using mlx_vlm and vMLX and I get 400 Bad Request errors from Claude Code. Presumably you're not seeing those issues with llama-server ?
Correct, no issues because since at least a few months, llama.cpp/server exposes an Anthropic messages API at v1/messages, in addition to the OpenAI-compatible API at v1/chat/completions. Claude Code uses the former.
At least for the Gemma4-26B-A4B, Token-gen speed with OMLX is far worse on my M1 Max 64GB Macbook, compared to llama-server:
Quick benchmark on M1 Max 64GB, Gemma 4 26B-A4B (MoE), comparing matched dynamic 4-bit quants. Workload
was Claude Code, which sends ~35K tokens of input context per request (system prompt + tools + user
message):
llama.cpp (unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL, llama-server -fa on -c 131072 --jinja --temp 1.0
--top-p 0.95 --top-k 64):
- pp ≈ 395 tok/s
- tg ≈ 40 tok/s
oMLX (unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit, omlx serve --model-dir ~/models/omlx, with
sampling.max_context_window and max_tokens bumped to 131072 in ~/.omlx/settings.json):
- pp ≈ 350 tok/s
- tg ≈ 5–13 tok/s
Same model family and quant tier. Prompt processing is comparable, but oMLX's token generation is 3–7x
slower than llama.cpp's Metal backend. Counter-intuitive given MLX is Apple's native ML framework.
Same. Opencode + oMLX (0.3.4) + unsloth-Qwen3-Coder-Next-mlx-8bit on my M5 Max w 128GB is the sweet spot for me locally. The prompt decode caching keeps things coherent and fast even when contexts get north of 100k tokens.
Local models are finally starting to feel pleasant instead of just "possible." The headless LM Studio flow is especially nice because it makes local inference usable from real tools instead of as a demo.
Related note from someone building in this space: I've been working on cloclo (https://www.npmjs.com/package/cloclo), an open-source coding agent CLI, and this is exactly the direction I'm excited about. It natively supports LM Studio, Ollama, vLLM, Jan, and llama.cpp as providers alongside cloud models, so you can swap between local and hosted backends without changing how you work.
Feels like we're getting closer to a good default setup where local models are private/cheap enough to use daily, and cloud models are still there when you need the extra capability.
Just FYI, MoE doesn't really save (V)RAM. You still need all weights loaded in memory, it just means you consult less per forward pass. So it improves tok/s but not vram usage.
The interesting bottleneck I keep running into isn’t just model quality — it’s lifecycle management of models in constrained environments (load → run → unload patterns, plus routing between different models depending on task type).
Curious if anyone else is exploring per-request model execution rather than keeping models resident all the time.
Using Claude Code seems like a popular frontend currently, I wonder how long until Anthropic releases an update to make it a little to a lot less turn-key? They've been very clear that they aren't exactly champions of this stuff being used outside of very specific ways.
Seems like this might be a great way to do web software testing. We’ve had Selenium and Puppeteer for a long time but they are a bit brittle with respect to the web design. Change something about the design and there’s a high likelihood that a test will break. Seems like this might be able to be smarter about adapting to changes. That’s also a great use for a smaller model like this.
PSA: For those getting stuck in a repetitive loop or just stopping without completing a task, try the interactive template. I just tried it now and it's blowing my already impressive results out of the water (llama.cpp):
Totally agree lmstudio headless server on a remote machine but control models from your laptop is an amazing workflow. But Gemma 4 was not a good model atleast in my trials “find me the largest text file in all of the current sub folders” it went on a loopy tool call for ever even with Q8
I want local models to succeed, but today the gap vs cloud models still seems continually too large. Even with a $2k GPU or a $4k MBP, the quality and speed tradeoff usually isn’t sensible.
Credit to Google for releasing Gemma 4, though. I’d love to see local models reach the point where a 32 GB machine can handle high quality agentic coding at a practical speed.
I could see a future in which the major AI labs run a local LLM to offload much of the computational effort currently undertaken in the cloud, leaving the heavy lifting to cloud-hosted models and the easier stuff for local inference.
Qwen3-coder has been better for coding in my experience and has similar sizes. Either way, after a bunch of frustration with the quality and price of CC lately I’m happy there are local options.
103 comments
https://pchalasani.github.io/claude-code-tools/integrations/...
The 26BA4B is the most interesting to run on such hardware, and I get nearly double the token-gen speed (40 tok/s) compared to Qwen3.5 35BA3B. However the tau2 bench results[1] for this Gemma4 variant lag far behind the Qwen variant (68% vs 81%), so I don’t expect the former to do well on heavy agentic tool-heavy tasks:
[1] https://news.ycombinator.com/item?id=47616761
omlx serve? If so, how are you bumping up the max context size? I'm not seeing a param to go above 32k?Related note from someone building in this space: I've been working on cloclo (https://www.npmjs.com/package/cloclo), an open-source coding agent CLI, and this is exactly the direction I'm excited about. It natively supports LM Studio, Ollama, vLLM, Jan, and llama.cpp as providers alongside cloud models, so you can swap between local and hosted backends without changing how you work.
Feels like we're getting closer to a good default setup where local models are private/cheap enough to use daily, and cloud models are still there when you need the extra capability.
Curious if anyone else is exploring per-request model execution rather than keeping models resident all the time.
$ llama-server --reasoning auto --fit on -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64
$ uvx swival --provider llamacpp
Done.
Credit to Google for releasing Gemma 4, though. I’d love to see local models reach the point where a 32 GB machine can handle high quality agentic coding at a practical speed.
Why/why not?
It's so jank, there are far superior cli coding harness out there