Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code (ai.georgeliu.com)

by vbtechguy 103 comments 407 points
Read article View on HN

103 comments

[−] d4rkp4ttern 39d ago
You can use llama.cpp server directly to serve local LLMs and use them in Claude Code or other CLI agents. I’ve collected full setup instructions for Gemma4 and other recent open-weight LLMs here, tested on my M1 Max 64 GB MacBook:

https://pchalasani.github.io/claude-code-tools/integrations/...

The 26BA4B is the most interesting to run on such hardware, and I get nearly double the token-gen speed (40 tok/s) compared to Qwen3.5 35BA3B. However the tau2 bench results[1] for this Gemma4 variant lag far behind the Qwen variant (68% vs 81%), so I don’t expect the former to do well on heavy agentic tool-heavy tasks:

[1] https://news.ycombinator.com/item?id=47616761

[−] peder 39d ago
Did you have any Anthropic vs OpenAI specification issues with Claude Code? I have been using mlx_vlm and vMLX and I get 400 Bad Request errors from Claude Code. Presumably you're not seeing those issues with llama-server ?
[−] d4rkp4ttern 39d ago
Correct, no issues because since at least a few months, llama.cpp/server exposes an Anthropic messages API at v1/messages, in addition to the OpenAI-compatible API at v1/chat/completions. Claude Code uses the former.
[−] selectodude 39d ago
I’ve jumped over to oMLX. A ton of rough edges but I think it’s the future.
[−] d4rkp4ttern 39d ago
At least for the Gemma4-26B-A4B, Token-gen speed with OMLX is far worse on my M1 Max 64GB Macbook, compared to llama-server:

  Quick benchmark on M1 Max 64GB, Gemma 4 26B-A4B (MoE), comparing matched dynamic 4-bit quants. Workload
  was Claude Code, which sends ~35K tokens of input context per request (system prompt + tools + user
  message):

  llama.cpp (unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL, llama-server -fa on -c 131072 --jinja --temp 1.0
  --top-p 0.95 --top-k 64):
  - pp ≈ 395 tok/s
  - tg ≈ 40 tok/s

  oMLX (unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit, omlx serve --model-dir ~/models/omlx, with
  sampling.max_context_window and max_tokens bumped to 131072 in ~/.omlx/settings.json):
  - pp ≈ 350 tok/s
  - tg ≈ 5–13 tok/s

  Same model family and quant tier. Prompt processing is comparable, but oMLX's token generation is 3–7x
  slower than llama.cpp's Metal backend. Counter-intuitive given MLX is Apple's native ML framework.
[−] unstatusthequo 36d ago
Check out vMLX if you use Apple Silicon. https://github.com/jjang-ai/mlxstudio
[−] vlowther 39d ago
Same. Opencode + oMLX (0.3.4) + unsloth-Qwen3-Coder-Next-mlx-8bit on my M5 Max w 128GB is the sweet spot for me locally. The prompt decode caching keeps things coherent and fast even when contexts get north of 100k tokens.
[−] peder 39d ago
Have you been using omlx serve? If so, how are you bumping up the max context size? I'm not seeing a param to go above 32k?
[−] d4rkp4ttern 39d ago
you can set it in the .omlx/settings.json - ask a code-agent to figure it out by pointing it at the omlx repo
[−] tatrions 39d ago
[flagged]
[−] seifbenayed1992 39d ago
Local models are finally starting to feel pleasant instead of just "possible." The headless LM Studio flow is especially nice because it makes local inference usable from real tools instead of as a demo.

Related note from someone building in this space: I've been working on cloclo (https://www.npmjs.com/package/cloclo), an open-source coding agent CLI, and this is exactly the direction I'm excited about. It natively supports LM Studio, Ollama, vLLM, Jan, and llama.cpp as providers alongside cloud models, so you can swap between local and hosted backends without changing how you work.

Feels like we're getting closer to a good default setup where local models are private/cheap enough to use daily, and cloud models are still there when you need the extra capability.

[−] SeriousM 39d ago
How does cloclo differ from pi-mono?
[−] seifbenayed1992 39d ago
pi-mono is a great toolkit — coding agent CLI, unified LLM API, web UI, Slack bot, vLLM pods.

cloclo is a runtime for agent toolkits. You plug it into your own agents and it gives them multi-agent orchestration (AICL protocol), 13 providers, skill registry, native browser/docs/phone tools, memory, and an NDJSON bridge. Zero native deps.

[−] trvz 40d ago

  ollama launch claude --model gemma4:26b
[−] datadrivenangel 40d ago
It's amazing how simple this is, and it just works if you have ollama and claude installed!
[−] gcampos 40d ago
You need to increase the context window size or the tool calling feature wont work
[−] mil22 40d ago
For those wondering how to do this:

  OLLAMA_CONTEXT_LENGTH=64000 ollama serve
or if you're using the app, open the Ollama app's Settings dialog and adjust there.

Codex also works:

  ollama launch codex --model gemma4:26b
[−] pshirshov 40d ago
For some reason, that doesn't work for me, claude never returns from some ill loop. Nemotron, glm and qwen 3.5 work just fine, gemma - doesn't.
[−] trvz 40d ago
Since that defaults to the q4 variant, try the q8 one:

  ollama launch claude --model gemma4:26b-a4b-it-q8_0
[−] pshirshov 40d ago
Even tried gemma4:31b and gemma4:31b with 128k context (I have 72GiB VRAM). Nothing. I'm cursed I guess. That's ollama-rocm if that matters (I had weird bugs on Vulkan, maybe gemma misbehaves on radeons somehow?..).

UPD: tried ollama-vulkan. It works, gemma4:31b-it-q8_0 with 64k context!

[−] alfiedotwtf 39d ago
The default context is 128k for the smaller Gemma 4’s and 256k for the bigger ones, so you’re cutting off context and it doesn’t know how to continue.

Bump it to native (or -c 0 may work too)

[−] pshirshov 39d ago
In that case the model descriptor on ollama.com is incorrect, because it defaults to 16k. So I have to manually change that to 64/128k. I think you are talking about maximum context size.
[−] trvz 39d ago
No, the default context in Ollama varies by the memory available: https://docs.ollama.com/context-length
[−] martinald 40d ago
Just FYI, MoE doesn't really save (V)RAM. You still need all weights loaded in memory, it just means you consult less per forward pass. So it improves tok/s but not vram usage.
[−] IceWreck 40d ago
It does if you use an inference engine where you can offload some of the experts from VRAM to CPU RAM. That means I can fit a 35 billion param MoE in let's say 12 GB VRAM GPU + 16 gigs of memory.
[−] Yukonv 40d ago
With that you are taking a significant performance penalty and become severely I/O bottlenecked. I've been able to stream Qwen3.5-397B-A17B from my M5 Max (12 GB/s SSD Read) using the Flash MoE technique at the brisk pace of 10 tokens per second. As tokens are generated different experts need to be consulted resulting in a lot of I/O churn. So while feasible it's only great for batch jobs not interactive usage.
[−] IceWreck 40d ago

> So while feasible it's only great for batch jobs not interactive usage.

I mean yeah true but depends on how big the model is. The example I gave (Qwen 3.5 35BA3B) was fitting a 35B Q4 K_M (say 20 GB in size) model in 12 GB VRAM. With a 4070Ti + high speed 32 GB DDR5 ram you can easily get 700 token/sec prompt processing and 55-60 token/sec generation which is quite fast.

On the other hand if I try to fit a 120B model in 96 GB of DDR5 + the same 12 GB VRAM I get 2-5 token/sec generation.

[−] zozbot234 40d ago
Your 120B model likely has way more active parameters, so it can probably only fit a few shared layers in the VRAM for your dGPU. You might be better off running that model on a unified memory platform, slower VRAM but a lot more of it.
[−] IceWreck 39d ago
Yep, I understand I was giving an example to the person I was replying to.
[−] zozbot234 40d ago
10 tok/s is quite fine for chatting, though less so for interaction with agentic workloads. So the technique itself is still worthwhile for running a huge model locally.
[−] functional_dev 39d ago
This confused me at first as well.. inactive experts skip compute, but weights are sill loaded. So memory does not shrink at all.

I found this visualisation helpful - https://vectree.io/c/sparse-activation-patterns-and-memory-e...

[−] charcircuit 40d ago
You never need to have all weights in memory. You can swap them in from RAM, disk, the network, etc. MOE reduces the amount of data that will need to be swapped in for the next forward pass.
[−] martinald 40d ago
Yes you're right technically, but in reality you'd be swapping them the (vast?) majority in and out per inference request so would create an enormous bottleneck for the use case the author is using for.
[−] charcircuit 40d ago
You don't have to only have the experts being actively used in VRAM. You can load as many weights as will fit. If there is a "cache miss" you have to pay the price to swap in the weights, but if there is a hit you don't.
[−] zozbot234 40d ago
With unified memory, reading from RAM to GPU compute buffer is not that painful, and you can use partial RAM caching to minimize the impact of other kinds of swapping.
[−] mikkupikku 39d ago
In practical terms, is this kind of architecture available to consumers except through Apple?
[−] vbtechguy 40d ago
Here is how I set up Gemma 4 26B for local inference on macOS that can be used with Claude Code.
[−] ashwanth_megas 33d ago
The interesting bottleneck I keep running into isn’t just model quality — it’s lifecycle management of models in constrained environments (load → run → unload patterns, plus routing between different models depending on task type).

Curious if anyone else is exploring per-request model execution rather than keeping models resident all the time.

[−] jonplackett 40d ago
So wait what is the interaction between Gemma and Claude?
[−] asymmetric 40d ago
Is a framework desktop with >48GB of RAM a good machine to try this out?
[−] jedisct1 39d ago
Running Gemma 4 with llama.cpp and Swival:

$ llama-server --reasoning auto --fit on -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64

$ uvx swival --provider llamacpp

Done.

[−] Someone1234 40d ago
Using Claude Code seems like a popular frontend currently, I wonder how long until Anthropic releases an update to make it a little to a lot less turn-key? They've been very clear that they aren't exactly champions of this stuff being used outside of very specific ways.
[−] drob518 39d ago
Seems like this might be a great way to do web software testing. We’ve had Selenium and Puppeteer for a long time but they are a bit brittle with respect to the web design. Change something about the design and there’s a high likelihood that a test will break. Seems like this might be able to be smarter about adapting to changes. That’s also a great use for a smaller model like this.
[−] alfiedotwtf 38d ago
PSA: For those getting stuck in a repetitive loop or just stopping without completing a task, try the interactive template. I just tried it now and it's blowing my already impressive results out of the water (llama.cpp):

    --jinja --chat-template-file models/templates/google-gemma-4-31B-it-interleaved.jinja
[−] bicepjai 37d ago
Totally agree lmstudio headless server on a remote machine but control models from your laptop is an amazing workflow. But Gemma 4 was not a good model atleast in my trials “find me the largest text file in all of the current sub folders” it went on a loopy tool call for ever even with Q8
[−] pseudosavant 39d ago
I want local models to succeed, but today the gap vs cloud models still seems continually too large. Even with a $2k GPU or a $4k MBP, the quality and speed tradeoff usually isn’t sensible.

Credit to Google for releasing Gemma 4, though. I’d love to see local models reach the point where a 32 GB machine can handle high quality agentic coding at a practical speed.

[−] ttul 39d ago
I could see a future in which the major AI labs run a local LLM to offload much of the computational effort currently undertaken in the cloud, leaving the heavy lifting to cloud-hosted models and the easier stuff for local inference.
[−] janalsncm 39d ago
Qwen3-coder has been better for coding in my experience and has similar sizes. Either way, after a bunch of frustration with the quality and price of CC lately I’m happy there are local options.
[−] AbuAssar 39d ago
omlx gives better performance than ollama on apple silicon
[−] Imanari 39d ago
How well do the Gemma 4 models perform on agentic coding? What are your impressions?
[−] aetherspawn 40d ago
Can you use the smaller Gemma 4B model as speculative decoding for the larger 31B model?

Why/why not?

[−] tiku 39d ago
I hate that my M5 with 24 gb has so much trouble with these models. Not getting any good speeds, even with simple models.
[−] _2fnr 39d ago
[flagged]
[−] edinetdb 40d ago
[flagged]
[−] meidad_g 39d ago
[flagged]
[−] techpulselab 40d ago
[dead]
[−] aimemobe 33d ago
[flagged]
[−] meidad_g 40d ago
[flagged]
[−] aplomb1026 40d ago
[dead]
[−] maxbeech 39d ago
[dead]
[−] inzlab 40d ago
awesome, the lighter the hardware running big softwares the more novelty.
[−] smcleod 39d ago
Did you try the MLX model instead? In general MLX tends provide much better performance than GGUF/Llama.cpp on macOS.
[−] NamlchakKhandro 40d ago
I don't know why people bother with Claude code.

It's so jank, there are far superior cli coding harness out there