Google releases Gemma 4 open models

[−] danielhanchen 43d ago

Thinking / reasoning + multimodal + tool calling.

We made some quants at https://huggingface.co/collections/unsloth/gemma-4 for folks to run them - they work really well!

Guide for those interested: https://unsloth.ai/docs/models/gemma-4

Also note to use temperature = 1.0, top_p = 0.95, top_k = 64 and the EOS is "". "<|channel>thought\n" is also used for the thinking trace!

[−] evilelectron 43d ago

Daniel, your work is changing the world. More power to you.

I setup a pipeline for inference with OCR, full text search, embedding and summarization of land records dating back 1800s. All powered by the GGUF's you generate and llama.cpp. People are so excited that they can now search the records in multiple languages that a 1 minute wait to process the document seems nothing. Thank you!

[−] danielhanchen 43d ago

Oh appreciate it!

Oh nice! That sounds fantastic! I hope Gemma-4 will make it even better! The small ones 2B and 4B are shockingly good haha!

[−] qingcharles 42d ago

Just switched from 3.1 Flash Lite to Gemma-4 31B on the AI Studio API since there is a generous 1500/day on non-billed projects. It's doing fantastic.

[−] polishdude20 43d ago

Hey in really interested in your pipeline techniques. I've got some pdfs I need to get processed but processing them in the cloud with big providers requires redaction.

Wondering if a local model or a self hosted one would work just as well.

[−] evilelectron 43d ago

I run llama.cpp with Qwen3-VL-8B-Instruct-Q4_K_S.gguf with mmproj-F16.gguf for OCR and translation. I also run llama.cpp with Qwen3-Embedding-0.6B-GGUF for embeddings. Drupal 11 with ai_provider_ollama and custom provider ai_provider_llama (heavily derived from ai_provider_ollama) with PostreSQL and pgvector.

People on site scan the documents and upload them for archival. The directory monitor looks for new files in the archive directories and once a new file is available, it is uploaded to Drupal. Once a new content is created in Drupal, Drupal triggers the translation and embedding process through llama.cpp. Qwen3-VL-8B is also used for chat and RAG. Client is familiar with Drupal and CMS in general and wanted to stay in a similar environment. If you are starting new I would recommend looking at docling.

[−] lwhi 43d ago

Are you linking any of the processes using the Drupal AI module suite?

[−] evilelectron 42d ago

Yes, they are all linked using Drupal's AI modules. I have an OpenCV application that removes the old paper look, enhances the contrast and fixes the orientation of the images before they hit llama.cpp for OCR and translation.

[−] chrisweekly 43d ago

Disclaimer: I'm an AI novice relative to many here. FWIW last wknd I spent a couple hours setting up self-hosted n8n with ollama and gemma3:4b [EDIT: not Qwen-3.5], using PDF content extraction for my PoC. 100% local workflow, no runtime dependency on cloud providers. I doubt it'd scale very well (macbook air m4, measly 16GB RAM), but it works as intended.

[−] patrickk 43d ago

For those who wish to do OCR on photos, like receipts, or PDFs or anything really, Paperless-NGX works amazingly well and runs on a potato.

[−] polishdude20 43d ago

How do you extract the content? OCR? Pdf to text then feed into qwen?

I tried something similar where I needed a bunch of tables extracted from the pdf over like 40 pages. It was crazy slow on my MacBook and innacurate

[−] philipkglass 43d ago

If you have a basic ARM MacBook, GLM-OCR is the best single model I have found for OCR with good table extraction/formatting. It's a compact 0.9b parameter model, so it'll run on systems with only 8 GB of RAM.

https://github.com/zai-org/GLM-OCR

Use mlx-vlm for inference:

https://github.com/zai-org/GLM-OCR/blob/main/examples/mlx-de...

Then you can run a single command to process your PDF:

  glmocr parse example.pdf

  Loading images: example.pdf
  Found 1 file(s)
  Starting Pipeline...
  Pipeline started!
  GLM-OCR initialized in self-hosted mode
  Using Pipeline (enable_layout=true)...

  === Parsing: example.pdf (1/1) ===

My test document contains scanned pages from a law textbook. It's two columns of text with a lot of footnotes. It took 60 seconds to process 5 pages on a MBP with M4 Max chip.

After it's done, you'll have a directory output/example/ that contains .md and .json files. The .md file will contain a markdown rendition of the complete document. The .json file will contain individual labeled regions from the document along with their transcriptions. If you get all the JSON objects with

  "label": "table"

from the JSON file, you can get an HTML-formatted table from each "content" section of these objects.

It might still be inaccurate -- I don't know how challenging your original tables are -- but it shouldn't be terribly slow. The tables it produced for me were good.

I have also built more complex work flows that use a mixture of OCR-specialized models and general purpose VLM models like Qwen 3.5, along with software to coordinate and reconcile operations, but GLM-OCR by itself is the best first thing to try locally.

[−] polishdude20 43d ago

Thanks! Just tried it on a 40 page pdf. Seems to work for single images but the large pdf gives me connection timeouts

[−] philipkglass 43d ago

I also get connection timeouts on larger documents, but it automatically retries and completes. All the pages are processed when I'm done. However, I'm using the Python client SDK for larger documents rather than the basic glmocr command line tool. I'm not sure if that makes a difference.

[−] polishdude20 43d ago

Yeah looks like the cli also retries as well. I was able to get it working using a higher timeout.

[−] davidbjaffe 42d ago

Cool! For GLM-OCR, do you use "Option 2: Self-host with vLLM / SGLang" and in that case, am I correct that there is no internet connection involved and hence connection timeouts would be avoided entirely?

[−] philipkglass 42d ago

When you self-host, there's still a client/server relationship between your self-hosted inference server and the client that manages the processing of individual pages. You can get timeouts depending on the configured timeouts, the speed of your inference server, and the complexity of the pages you're processing. But you can let the client retry and/or raise the initial timeout limit if you keep running into timeouts.

That said, this is already a small and fast model when hosted via MLX on macOS. If you run the inference server with a recent NVidia GPU and vLLM on Linux it should be significantly faster. The big advantage with vLLM for OCR models is its continuous batching capability. Using other OCR models that I couldn't self-host on macOS, like DeepSeek 2 OCR or Chandra 2, vLLM gave dramatic throughput improvements on big documents via continuous batching if I process 8-10 pages at a time. This is with a single 4090 GPU.

[−] chrisweekly 43d ago

1. Correction: I'd planned to use Qwen-3.5 but ended up using gemma3:4b.

2. The n8n workflow passes a given binary pdf to gemma, which (based on a detailed prompt) analyzes it and produces JSON output.

See https://github.com/LinkedInLearning/build-with-ai-running-lo... if you want more details. :)

[−] tehologist 43d ago

Python pdftools to convert to images and tesseract to ocr them to text files. Fast free and can run on CPU.

[−] jorl17 43d ago

Seconded, would also love to hear your story if you would be willing

[−] Breza 42d ago

I'm very active in family history and this kind of project is massively helpful, thank you

[−] wok4899 41d ago

This is a very interesting project. If it's publicly available, would you mind sharing it? I would love to understand how it works.

Ps: found your other comments, thanks.

[−] irishcoffee 42d ago

> your work is changing the world

I realize this may have been hyperbole, but it sure isn't changing the world.

[−] a96 40d ago

For relatively small values of changing or world, it sure is.

In the world of local models, Unsloth is one of the most significant projects there is.

[−] akavel 43d ago

I'm trying to disable "thinking", but it doesn't seem to work (in llama.cpp). The usual --reasoning-budget 0 doesn't seem to change it, nor --chat-template-kwargs '{"enable_thinking":false}' (both with --jinja). Am I missing something?

EDIT: Ok, looks like there's yet another new flag for that in llama.cpp, and this one seems to work in this case: --reasoning off.

FWIW, I'm doing some initial tries of unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL, and for writing some Nix, I'm VERY impressed - seems significantly better than qwen3.5-35b-a3b for me for now. Example commandline on a Macbook Air M4 32gb RAM:

  llama-cli -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL  -t 1.0 --top-p 0.95 --top-k 64 -fa on --no-mmproj --reasoning-budget 0 -c 32768 --jinja --reasoning off

(at release b8638, compiled with Nix)

[−] danielhanchen 43d ago

Oh very cool! Will check the --reasoning off flag as well!

Yep the models are really good!

[−] Imustaskforhelp 43d ago

Daniel, I know you might hear this a lot but I really appreciate a lot of what you have been doing at Unsloth and the way you handle your communication, whether within hackernews/reddit.

I am not sure if someone might have asked this already to you, but I have a question (out of curiosity) as to which open source model you find best and also, which AI training team (Qwen/Gemini/Kimi/GLM) has cooperated the most with the Unsloth team and is friendly to work with from such perspective?

[−] danielhanchen 43d ago

Thanks a lot for the support :)

Tbh Gemma-4 haha - it's sooooo good!!!

For teams - Google haha definitely hands down then Qwen, Meta haha through PyTorch and Llama and Mistral - tbh all labs are great!

[−] Imustaskforhelp 43d ago

Now you have gotten me a bit excited for Gemma-4, Definitely gonna see if I can run the unsloth quants of this on my mac air & thanks for responding to my comment :-)

[−] danielhanchen 43d ago

Thanks! Have a super good day!!

[−] genpfault 43d ago

llama.cpp (b8642) auto-fits ~200k context on this 24GB RX 7900 XTX & it shows a solid 100+ tok/s ("S_TG t/s") on the first 32k of it, nice!

    ./llama-batched-bench -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
    -npp 1000,2000,4000,8000,16000,32000,64000,96000,128000 -ntg 128 -npl 1 -c 0
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    0.416 |  2404.87 |    1.064 |   120.29 |    1.480 |   762.20 |
    |  2000 |    128 |    1 |   2128 |    0.755 |  2649.86 |    1.075 |   119.04 |    1.830 |  1162.83 |
    |  4000 |    128 |    1 |   4128 |    1.501 |  2665.72 |    1.093 |   117.08 |    2.594 |  1591.49 |
    |  8000 |    128 |    1 |   8128 |    3.142 |  2545.85 |    1.114 |   114.87 |    4.257 |  1909.47 |
    | 16000 |    128 |    1 |  16128 |    6.908 |  2316.00 |    1.189 |   107.65 |    8.097 |  1991.73 |
    | 32000 |    128 |    1 |  32128 |   16.382 |  1953.31 |    1.278 |   100.12 |   17.661 |  1819.16 |
    | 64000 |    128 |    1 |  64128 |   43.427 |  1473.74 |    1.453 |    88.12 |   44.879 |  1428.89 |
    | 96000 |    128 |    1 |  96128 |   82.227 |  1167.50 |    1.623 |    78.86 |   83.850 |  1146.42 |
    |128000 |    128 |    1 | 128128 |  133.237 |   960.69 |    1.797 |    71.25 |  135.034 |   948.86 |

[−] spwa4 42d ago

~50 tok/s on M1 Max 64Gb

[−] l2dy 43d ago

FYI, screenshot for the "Search and download Gemma 4" step on your guide is for qwen3.5, and when I searched for gemma-4 in Unsloth Studio it only shows Gemma 3 models.

[−] trashcan2137 43d ago

  and the EOS is "". "<|channel>thought\n" is also used for the thinking trace!

Can someone explain this to me? Why is this faux-XML important here?

[−] rizzo94 42d ago

Huge fan of the Unsloth quants! Having reasoning and tool calling this accessible locally is a massive leap forward.

The main hurdle I've found with local tool calling is managing the execution boundaries safely. I’ve started plugging these local models into PAIO to handle that. Since it acts as a hardened execution layer with strict BYOK sovereignty, it lets you actually utilize Gemma-4's tool calling capabilities without the low-level anxiety of a hallucination accidentally wiping your drive. It’s the perfect secure gateway for these advanced local models.

[−] Wowfunhappy 43d ago

Hi! Do you ever make quants of the base models? I'm interested in experimenting with them in non-chat contexts.

[−] zaat 43d ago

Thank you for your work.

You have an answer on your page regarding "Should I pick 26B-A4B or 31B?", but can you please clarify if, assuming 24GB vRAM, I should pick a full precision smaller model or 4 bit larger model?

[−] kapimalos 43d ago

Noob question. Why I would use this version over the original model?

[−] Kye 43d ago

I haven't tried a local model in a while. I can only fit E4B in VRAM (8GB), but it's good enough that I can see it replacing Claude.ai for some things.

[−] pentagrama 43d ago

Hey, I tried to use Unsloth to run Gemma 4 locally but got stuck during the setup on Windows 11.

At some point it asked me to create a password, and right after that it threw an error. Here’s a screenshot: https://imgur.com/a/sCMmqht

This happened after running the PowerShell setup, where it installed several things like NVIDIA components, VS Code, and Python. At the end, PowerShell tell me to open a http://localhost URL in my browser, and that’s where I was prompted to set the password before it failed.

Also, I noticed that an Unsloth icon was added to my desktop, but when I click it, nothing happens.

For context, I’m not a developer and I had never used PowerShell before. Some of the steps were a bit intimidating and I wasn’t fully sure what I was approving when clicking through.

The overall experience felt a bit rough for my level. It would be great if this could be packaged as a simple .exe or a standalone app instead of going through terminal and browser steps.

Are there any plans to make something like that?

[−] sillysaurusx 43d ago

Temperature 1.0 used to be bad for sampling. 0.7 was the better choice, and the difference in results were noticeable. You may want to experiment with this.

[−] sixhobbits 43d ago

Thanks for this, I gave this guide to my Claude and he oneshot the unsloth and gemma4 set up on the old macbook he runs on. It's way faster than I expected, haven't tried out local models for a few generations but will be very nice when they become useful

[−] egeres 43d ago

Thank you and your brother for all the amazing work, it's really inspiring to others <3

[−] zkmon 43d ago

How does Gemma 4 26B A4B compare with Qwen3.5 35B A3B for same quants(4)

[−] mmaunder 40d ago

This comment deserves it's own HN post. Thanks!

[−] jquery 43d ago

Awesome!! Thank you SO much for this.

[−] nnucera 43d ago

Wow! Thank you very much!

[−] zobzu 43d ago

neat, time to update my spam filter model hehe

[−] simonw 43d ago

I ran these in LM Studio and got unrecognizable pelicans out of the 2B and 4B models and an outstanding pelican out of the 26b-a4b model - I think the best I've seen from a model that runs on my laptop.

https://simonwillison.net/2026/Apr/2/gemma-4/

The gemma-4-31b model is completely broken for me - it just spits out "---\n" no matter what prompt I feed it. I got a pelican out of it via the AI Studio API hosted model instead.

[−] scrlk 43d ago

Comparison of Gemma 4 vs. Qwen 3.5 benchmarks, consolidated from their respective Hugging Face model cards:

    | Model          | MMLUP | GPQA  | LCB   | ELO  | TAU2  | MMMLU | HLE-n | HLE-t |
    |----------------|-------|-------|-------|------|-------|-------|-------|-------|
    | G4 31B         | 85.2% | 84.3% | 80.0% | 2150 | 76.9% | 88.4% | 19.5% | 26.5% |
    | G4 26B A4B     | 82.6% | 82.3% | 77.1% | 1718 | 68.2% | 86.3% |  8.7% | 17.2% |
    | G4 E4B         | 69.4% | 58.6% | 52.0% |  940 | 42.2% | 76.6% |   -   |   -   |
    | G4 E2B         | 60.0% | 43.4% | 44.0% |  633 | 24.5% | 67.4% |   -   |   -   |
    | G3 27B no-T    | 67.6% | 42.4% | 29.1% |  110 | 16.2% | 70.7% |   -   |   -   |
    | GPT-5-mini     | 83.7% | 82.8% | 80.5% | 2160 | 69.8% | 86.2% | 19.4% | 35.8% |
    | GPT-OSS-120B   | 80.8% | 80.1% | 82.7% | 2157 |  --   | 78.2% | 14.9% | 19.0% |
    | Q3-235B-A22B   | 84.4% | 81.1% | 75.1% | 2146 | 58.5% | 83.4% | 18.2% |  --   |
    | Q3.5-122B-A10B | 86.7% | 86.6% | 78.9% | 2100 | 79.5% | 86.7% | 25.3% | 47.5% |
    | Q3.5-27B       | 86.1% | 85.5% | 80.7% | 1899 | 79.0% | 85.9% | 24.3% | 48.5% |
    | Q3.5-35B-A3B   | 85.3% | 84.2% | 74.6% | 2028 | 81.2% | 85.2% | 22.4% | 47.4% |

    MMLUP: MMLU-Pro
    GPQA: GPQA Diamond
    LCB: LiveCodeBench v6
    ELO: Codeforces ELO
    TAU2: TAU2-Bench
    MMMLU: MMMLU
    HLE-n: Humanity's Last Exam (no tools / CoT)
    HLE-t: Humanity's Last Exam (with search / tool)
    no-T: no think

[−] neonstatic 43d ago

Prompt:

> what is the Unix timestamp for this: 2026-04-01T16:00:00Z

Qwen 3.5-27b-dwq

> Thought for 8 minutes 34 seconds. 7074 tokens.

> The Unix timestamp for 2026-04-01T16:00:00Z is:

> 1775059200 (my comment: Wednesday, 1 April 2026 at 16:00:00)

Gemma-4-26b-a4b

> Thought for 33.81 seconds. 694 tokens.

> The Unix timestamp for 2026-04-01T16:00:00Z is:

> 1775060800 (my comment: Wednesday, 1 April 2026 at 16:26:40)

Gemma considered three options to solve this problem. From the thinking trace:

> Option A: Manual calculation (too error-prone).

> Option B: Use a programming language (Python/JavaScript).

> Option C: Knowledge of specific dates.

It then wrote a python script:

  from datetime import datetime, timezone
  date_str = "2026-04-01T16:00:00Z"
  # Replace Z with +00:00 for ISO format parsing or just strip it
  dt = datetime.strptime(date_str, "%Y-%m-%dT%H:%M:%SZ").replace(tzinfo=timezone.utc)
  ts = int(dt.timestamp())
  print(ts)

Then it verified the timestamp with a command:

  date -u -d @1775060800

All of this to produce a wrong result. Running the python script it produced gives the correct result. Running the verification date command leads to a runtime error (hallucinated syntax). On the other hand Qwen went straight to Option A and kept overthinking the question, verifying every step 10 times, experienced a mental breakdown, then finally returned the right answer. I think Gemma would be clearly superior here if it used the tools it came up with rather than hallucinating using them.

[−] canyon289 43d ago

Hi all! I work on the Gemma team, one of many as this one was a bigger effort given it was a mainline release. Happy to answer whatever questions I can

[−] chrislattner 43d ago

If you want the fastest open source implementation on Blackwell and AMD MI355, check out Modular's MAX nightly. You can pip install it super fast, check it out here: https://www.modular.com/blog/day-zero-launch-fastest-perform...

-Chris Lattner (yes, affiliated with Modular :-)

[−] NitpickLawyer 43d ago

Best thing is that this is Apache 2.0 (edit: and they have base models available. Gemma3 was good for finetuning)

The sizes are E2B and E4B (following gemma3n arch, with focus on mobile) and 26BA4 MoE and 31B dense. The mobile ones have audio in (so I can see some local privacy focused translation apps) and the 31B seems to be strong in agentic stuff. 26BA4 stands somewhere in between, similar VRAM footprint, but much faster inference.

[−] antirez 43d ago

Featuring the ELO score as the main benchmark in chart is very misleading. The big dense Gemma 4 model does not seem to reach Qwen 3.5 27B dense model in most benchmarks. This is obviously what matters. The small 2B / 4B models are interesting and may potentially be better ASR models than specialized ones (not just for performances but since they are going to be easily served via llama.cpp / MLX and front-ends). Also interesting for "fast" OCR, given they are vision models as well. But other than that, the release is a bit disappointing.

[−] originalvichy 43d ago

The wait is finally over. One or two iterations, and I’ll be happy to say that language models are more than fulfilling my most common needs when self-hosting. Thanks to the Gemma team!

[−] swalsh 43d ago

I gave the same prompt (a small rust project that's not easy, but not overly sophisticated) to both Gemma-4 26b and Qwen 3.5 27b via OpenCode. Qwen 3.5 ran for a bit over an hour before I killed it, Gemma 4 ran for about 20 minutes before it gave up. Lots of failed tool calls.

I asked codex to write a summary about both code bases.

"Dev 1" Qwen 3.5

"Dev 2" Gemma 4

Dev 1 is the stronger engineer overall. They showed better architectural judgment, stronger completeness, and better maintainability instincts. The weakness is execution rigor: they built more, but didn’t verify enough, so important parts don’t actually hold up cleanly.

Dev 2 looks more like an early-stage prototyper. The strength is speed to a rough first pass, but the implementation is much less complete, less polished, and less dependable. The main weakness is lack of finish and technical rigor.

If I were choosing between them as developers, I’d take Dev 1 without much hesitation.

Looking at the code myself, i'd agree with codex.

[−] d4rkp4ttern 43d ago

For token-generation speed, a challenging test is to see how it performs in a code-agent harness like Claude Code, which has anywhere between 15-40K tokens from the system prompt itself (+ tools/skills etc).

Here the 26B-A4B variant is head and shoulders above recent open-weight models, at least on my trusty M1 Max 64GB MacBook.

I set up Claude Code to use this variant via llama-server, with 37K tokens initial context, and it performs very well: ~40 tokens/sec, far better than Qwen3.5-35B-A3B, though I don't know yet about the intelligence or tool-calling consistency. Prompt processing speed is comparable to the Qwen variant at ~400 tok/s.

My informal tests, all with roughly 30K-37K tokens initial context:

    ┌────────────────────┬───────────────┬────────────┐
    │       Model        │ Active Params │ tg (tok/s) │
    ├────────────────────┼───────────────┼────────────┤
    │ Gemma-4-26B-A4B    │ 4B            │ ~40        │
    ├────────────────────┼───────────────┼────────────┤
    │ GPT-OSS-20B        │ 3.6B          │ ~17-38     │
    ├────────────────────┼───────────────┼────────────┤
    │ Qwen3-30B-A3B      │ 3B            │ ~15-27     │
    ├────────────────────┼───────────────┼────────────┤
    │ GLM-4.7-Flash      │ 3B            │ ~12-13     │
    ├────────────────────┼───────────────┼────────────┤
    │ Qwen3.5-35B-A3B    │ 3B            │ ~12        │
    ├────────────────────┼───────────────┼────────────┤
    │ Qwen3-Next-80B-A3B │ 3B            │ ~3-5       │
    └────────────────────┴───────────────┴────────────┘

Full instructions for running this and other open-weight models with Claude Code are here:

https://pchalasani.github.io/claude-code-tools/integrations/...

[−] minimaxir 43d ago

The benchmark comparisons to Gemma 3 27B on Hugging Face are interesting: The Gemma 4 E4B variant (https://huggingface.co/google/gemma-4-E4B-it) beats the old 27B in every benchmark at a fraction of parameters.

The E2B/E4B models also support voice input, which is rare.

[−] nl 43d ago

Gemma-4-E4B-it scored 15/25 on my https://sql-benchmark.nicklothian.com/#all-data (agentic SQL generation).

The naming is a bit odd - E4B is "4.5B effective, 8B with embeddings", so despite the name it is probably best compared with the 8B/9B class models and is competitive with them.

Qwen3.5-9B also scores 15/25 in thinking mode for example. The best 9B model I've found is Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2 which gets to 17/25

gemma-4-E2B (4bit quant) scored 12/25, but is really a 5B model. That's the same as NVIDIA-Nemotron-3-Nano-4B which is the best 4B model I've found (yes, better than Qwen 4B).

That's a great score for a small model.

[−] Analog24 43d ago

So the "E2B" and "E4B" models are actually 5B and 8B parameters. Are we really going to start referring to the "effective" parameter count of dense models by not including the embeddings?

These models are impressive but this is incredibly misleading. You need to load the embeddings in memory along with the rest of the model so it makes no sense o exclude them from the parameter count. This is why it actually takes 5GB of RAM to run the "2B" model with 4-bit quantization according to Unsloth (when I first saw that I knew something was up).

[−] mudkipdev 43d ago

Can't wait for gemma4-31b-it-claude-opus-4-6-distilled-q4-k-m on huggingface tomorrow

[−] karimf 43d ago

I'm curious about the multimodal capabilities on the E2B and E4B and how fast is it.

In ChatGPT right now, you can have a audio and video feed for the AI, and then the AI can respond in real-time.

Now I wonder if the E2B or the E4B is capable enough for this and fast enough to be run on an iPhone. Basically replicating that experience, but all the computations (STT, LLM, and TTS) are done locally on the phone.

I just made this [0] last week so I know you can run a real-time voice conversation with an AI on an iPhone, but it'd be a totally different experience if it can also process a live camera feed.

https://github.com/fikrikarim/volocal

[−] bertili 43d ago

The timing is interesting as Apple supposedly will distill google models in the upcoming Siri update [1]. So maybe Gemma is a lower bound on what we can expect baked into iPhones.

[1] https://news.ycombinator.com/item?id=47520438

[−] stevenhubertron 43d ago

Still pretty unusable on Raspberry Pi 5, 16gb despite saying its built for it, from the E4B model

  total duration:       12m41.34930419s
  load duration:        549.504864ms
  prompt eval count:    25 token(s)
  prompt eval duration: 309.002014ms
  prompt eval rate:     80.91 tokens/s
  eval count:           2174 token(s)
  eval duration:        12m36.577002621s
  eval rate:            2.87 tokens/s

Prompt: whats a great chicken breast recipe for dinner tonight?

[−] Deegy 43d ago

So what's the business strategy here?

Google is the only USA based frontier lab releasing open models. I know they aren't doing it out of the goodness of their hearts.

[−] mikewarot 42d ago

I updated Ollama (again) and changed my windows swap file settings to use up to 200 Gb of C: (an SSD). On the largest model (gemma4:31b), I seem to be getting about 5 tokens per second. This is amazing to me, because I'm using a $100 computer, without any fancy GPU. I love watching it "think".

Consider this is thousands of times faster than any written conversations in the past. Those involved pieces of paper being transported, read, considered, replies written, then transported back.

If it'll write code that doesn't completely suck, I think even this is good enough. What do you consider the lowest acceptable rate of generating tokens/second?

[−] try-working 43d ago

The biggest story here is that this is Google handing Qwen the SOTA crown for small and medium models.

For the first time ever, a Chinese lab is at the frontier. Google and Nvidia are significantly behind, not just on benchmarks but real-world performance like tool calling accuracy.

[−] aggregator-ios 43d ago

I tested the E2B and E4B models and they get close but inaccurate (non working) results when generating jq queries from natural language.

This is of importance to me as I work on https://jsonquery.app and would prefer to use a model that works well with browser inference.

gemma-4-26b-a4b-it and gemma-4-31b-it produced accurate results in a few of my tests. But those are 50-60GB in size. Chrome has a developer preview that bundles Gemini Nano (under 2GB) and it used to work really well, but requires a few switches to be manually switched on, and has recently gotten worse in quality when testing for jq generation.

[−] ceroxylon 43d ago

Even with search grounding, it scored a 2.5/5 on a basic botanical benchmark. It would take much longer for the average human to do a similar write-up, but they would likely do better than 50% hallucination if they had access to a search engine.

[−] jwr 43d ago

Really looking forward to testing and benchmarking this on my spam filtering benchmark. gemma-3-27b was a really strong model, surpassed later by gpt-oss:20b (which was also much faster). qwen models always had more variance.

[−] VadimPR 43d ago

Gemma 3 E4E runs very quick on my Samsung S26, so I am looking forward to trying Gemma 4! It is fantastic to have local alternatives to frontier models in an offline manner.

[−] rvz 43d ago

Open weight models once again marching on and slowly being a viable alternative to the larger ones.

We are at least 1 year and at most 2 years until they surpass closed models for everyday tasks that can be done locally to save spending on tokens.

[−] Reubend 43d ago

I would suggest that people stop overfocusing on benchmarks, and give this a try. Gemma 4 is performing really well for me, and seems to hallucinate much less than other models I tried in this size range.

[−] vicchenai 43d ago

The 4B being this capable is honestly surprising. Ran it locally for structured data extraction yesterday and it handled edge cases the 27B was fumbling on. Didn't expect to swap down that fast.

[−] Igor_Wiwi 43d ago

I created a blog post specifically about running these models locally on your machine (1 liner but getting gguf may take some time): https://igorstechnoclub.com/running-gemma-4-locally-in-almos...

[−] simonw 43d ago

Anyone figured out a recipe to run Gemma 4 E2B or E4B against audio files locally on a Mac?

Google releases Gemma 4 open models (deepmind.google)

474 comments