Gemma 4 on iPhone (apps.apple.com)

by janandonly 234 comments 868 points
Read article View on HN

234 comments

[−] karimf 40d ago
This app is cool and it showcases some use cases, but it still undersells what the E2B model can do.

I just made a real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B. I posted it on /r/LocalLLaMA a few hours ago and it's gaining some traction [0]. Here's the repo [1]

I'm running it on a Macbook instead of an iPhone, but based on the benchmark here [2], you should be able to run the same thing on an iPhone 17 Pro.

[0] https://www.reddit.com/r/LocalLLaMA/comments/1sda3r6/realtim...

[1] https://github.com/fikrikarim/parlor

[2] https://huggingface.co/litert-community/gemma-4-E2B-it-liter...

[−] dang 40d ago
Re-upped here:

Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B - https://news.ycombinator.com/item?id=47652007

[−] karimf 39d ago
Oh wow, that's awesome. Thanks a lot, dang!
[−] nothinkjustai 40d ago
Parlor is so cool, especially since you’re offering it for free. And a great use case for local LLMs.
[−] karimf 40d ago
Thanks! Although, I can't claim any credit for it. I just spent a day gluing what other people have built. Huge props to the Gemma team for building an amazing model and also an inference engine that's focused for edge devices [0]

[0] https://github.com/google-ai-edge/LiteRT-LM

[−] storus 40d ago
That's cool! You can add SoulX-FlashHead for real-time AI head animation as well if you want to simulate a teacher.
[−] karimf 40d ago
Thanks for sharing! I'm still torn about it. Sure it'll feel more natural if you have the AI head animation, but I don't want people to get attached to it. I don't want to make the loneliness epidemic even worse.
[−] fjb040911 38d ago
[dead]
[−] PullJosh 40d ago
This is awesome!

1) I am able to run the model on my iPhone and get good results. Not as good as Gemini in the cloud, but good.

2) I love the “mobile actions” tool calls that allow the LLM to turn on the flashlight, open maps, etc. It would be fun if they added Siri Shortcuts support. I want the personal automation that Apple promised but never delivered.

3) I am so excited for local models to be normalized. I build little apps for teachers and there are stringent privacy laws involved that mean I strongly prefer writing code that runs fully client-side when possible. When I develop apps and websites, I want easy API access to on-device models for free. I know it sort of exists on iOS and Chrome right now, but as far as I’m aware it’s not particularly good yet.

[−] buzzerbetrayed 40d ago
For me the hallucination and gaslighting is like taking a step back in time a couple of years. It even fails the “r’s in strawberry” question. How nostalgic.

It’s very impressive that this can run locally. And I hope we will continue to be able to run couple-year-old-equivalent models locally going forward.

[−] dimmke 40d ago
I haven't seen anybody else post it in this thread, but this is running on 8GB of RAM. It's not the full Gemma 4 32B model. It's a completely different thing from the full Gemma 4 experience if you were running the flagship model, almost to the point of being misleading.

It's their E2B and E4B variants (so 2B and 4B but also quantized)

https://ai.google.dev/gemma/docs/core/model_card_4#dense_mod...

[−] zozbot234 40d ago
The relevant constraint when running on a phone is power, not really RAM footprint. Running the tiny E2B/E4B models makes sense, this is essentially what they're designed for.
[−] trvz 40d ago
It absolutely is RAM…

So much so that this was what made Apple increase their base sizes.

[−] Shawnj2 39d ago
Depends on the phone, I have trouble fitting models into memory on my iPhone 13 before iOS kills the app. I imagine newer phones with more RAM don’t have this issue especially with some new flagship phones having 16+ GB of memory
[−] bigyabai 39d ago
Between the GPU, NPU and big.LITTLE cores, many phones have no fewer than 4 different power profiles they can run inference at. It's about as solved as it will get without an architectural overhaul.
[−] 1f60c 40d ago
Strangely, reasoning is not on by default. If you enable it, it answers as you'd expect.
[−] shtack 40d ago
With reasoning on I found E4B to be solid, but E2B was completely unusable across several tests.
[−] janandonly 40d ago
OP Here. It is my firm belief that the only realistic use of AI in the future is either locally on-device for almost free, or in the cloud but way more expensive then it is today.

The latter option will only bemusedly for tasks that humans are more expensive or much slower in.

This Gemma 4 model gives me hope for a future Siri or other with iPhone and macOS integration, “Her” (as in the movie) style.

[−] crazygringo 40d ago

>

or in the cloud but way more expensive then it is today.

Why? It's widely understood that the big players are making profit on inference. The only reason they still have losses is because training is so expensive, but you need to do that no matter whether the models are running in the cloud or on your device.

If you think about it, it's always going to be cheaper and more energy-efficient to have dedicated cloud hardware to run models. Running them on your phone, even if possible, is just going to suck up your battery life.

[−] mbesto 40d ago

> It's widely understood that the big players are making profit on inference.

This is most definitely not widely understood. We still don't know yet. There's tons of discussions about people disagreeing on whether it really is profitable. Unless you have proof, don't say "this is widely understood".

[−] victorbjorklund 39d ago
You can look at open source models hosted by various companies that have no reason to host them on a loss.
[−] mbesto 39d ago
Uber ran their ridesharing at a loss for years. This is a very common way to gain market share.
[−] victorbjorklund 39d ago
What market share? We are talking commodity models where the host does not matter at all at OpenRouter etc.
[−] BeetleB 39d ago
Uber had massive VC investment and a moat. The companies he's referring to likely don't have much VC investment and zero moat.
[−] petesergeant 39d ago
I don’t have “proof” but the existence of so many providers of free models on OpenRouter strongly suggests inference is running at a profit. There’s no winner-takes-all angle to being a faceless provider there (often the consumer doesn’t know who fulfilled the request), so there’s just no incentive at all for these small provider companies to exist unless inference is profitable under the right conditions.
[−] pixelispoint 39d ago

>but the existence of so many providers of free models on OpenRouter strongly suggests inference is running at a profit

I don't think it suggests a profit, but rather a _hope_ for a _future_ profit, and a commitment to a strategy that may or may not pan out. Capitalism rewards those who are early to the party and commit to their bit.

[−] int_19h 39d ago
I recently had Codex working for 80+ hrs non stop (as in literally that was a single running session in response to a single prompt!).

Even at $200 monthly subscription that kind of stuff burns through tokens at a rate where it's very difficult to believe that they are even breaking even, never mind profit.

[−] dominotw 39d ago
thats nuts. what was it doing for 80 hrs?
[−] int_19h 34d ago
The project is a semantic parser for Lojban that emits Lean. The specific task was to add the ability to go in reverse - from (a subset of) Lean back to Lojban. So the bot had a corpus of something like 25K test cases that it had to make roundtrip, and instructions to keep going until the test suite is green.
[−] fsiefken 30d ago
Cool, I'm having all kinds of ideas about the use cases. Would using Loglish (Ben Goertzel's english adaptation) be practical? https://www.goertzel.org/new_research/Loglish.htm

I'm using that to communicate with my AI. Perhaps one day we'll speak a New Ithkuil variant with AI.

[−] shinycode 39d ago
Probably asked what’s the Answer to the Ultimate Question of Life, the Universe, and Everything
[−] igtt 40d ago
The reality is we can’t trust accounting earnings anyway.

We need to see the cash flows.

[−] zozbot234 40d ago
The big players are plausibly making profits on raw API calls, not subscriptions. These are quite costly compared to third-party inference from open models, but even setting that up is a hassle and you as a end user aren't getting any subsidy. Running inference locally will make a lot of sense for most light and casual users once the subsidies for subscription access cease.

Also while datacenter-based scaleout of a model over multiple GPUs running large batches is more energy efficient, it ultimately creates a single point of failure you may wish to avoid.

[−] janalsncm 40d ago

> It's widely understood that the big players are making profit on inference.

If you add in the cost of training, it’s not profitable.

Not including the cost of training is a bit like saying the only cost of a cup of coffee is the paper cup it’s in. The only way OpenAI gets to charge for inference is by selling a product people can’t get elsewhere for much cheaper, which means billions in R&D costs. But because of competition, each model effectively has a “shelf life”.

[−] tybit 40d ago
At least Anthropic claims that they are profitable on a per model basis. But since both revenue and training costs are growing exponentially, and they need to pay for model N training today, and only get revenue for model N-1 today, the offset makes it look worse than it is.

Obviously that doesn’t help them turn a profit, until they can stop growing training costs exponentially.

So it’s really a race to see whether growth in revenue or training costs decelerates first.

[−] tatrions 40d ago
[flagged]
[−] jfoster 40d ago
They will always be training new models, so if training is expensive, that's just part of the business they are in.

Vast amounts of capital have been poured in, but they continue to raise more. Presumably because they need more.

Is the capital being invested without any expectation of ROI?

[−] jrflowers 40d ago

> It's widely understood that the big players are making profit on inference.

I love the whole “they are making money if you ignore training costs” bit. It is always great to see somebody say something like “if you look at the amount of money that they’re spending it looks bad, but if you look away it looks pretty good” like it’s the money version of a solar eclipse

[−] skybrian 40d ago
The reason it matters is that if they are making a profit on inference, then when people use their services more, it cuts their losses. They might even break even eventually and start making a profit without raising the price.

But if they're losing money on inference, they will lose more money when people use their services more. There's no way to turn that around at that price.

[−] drawfloat 39d ago
We don't even have any evidence inference excluding training is actually profitable.
[−] victorbjorklund 39d ago
It is called sunk cost. The marginal cost is what sets the lower limit. They will always be able to sell at the marginal cost of inference.
[−] huijzer 40d ago
Laptop/desktop could work. Most systems are on charger most of time anyway
[−] nothinkjustai 40d ago

> It's widely understood that the big players are making profit on inference.

Are they? Or are they just saying that to make their offerings more attractive to investors?

Plus I think most people using agents for coding are using subscriptions which they are definitely not profitable in.

Locally running models that are snappy and mostly as capable as current sota models would be a dream. No internet connection required, no payment plans or relying on a third party provider to do your job. No privacy concerns. Etc etc.

[−] nl 40d ago

> Plus I think most people using agents for coding are using subscriptions which they are definitely not profitable in.

Where on earth do people get this idea? Subscriptions that are based around obscure, vendor defined "credits" are the perfect business model for vendors. They can change the amount you can use whenever they want.

It's likely they occasionally make a loss on some users but in general they are highly profitable for AI companies:

> Anthropic last month projected it would generate a 40% gross profit margin from selling AI to businesses and application developers in 2025

and

> OpenAI projected a gross margin of around 46% in 2025, including inference costs of both paying and nonpaying ChatGPT users.

https://archive.is/aKFYZ#selection-1075.0-1083.119

[−] nothinkjustai 40d ago
Both of those companies are losing hella money, dude just cuz they say they “expect” to be profitable doesn’t mean they are.
[−] zozbot234 40d ago
You can pick models that are snappy, or models that are as capable as SOTA. You don't really get both unless you spend extremely unreasonable amounts of money on what is essentially a datacenter-scale inference platform of your own, meant to service hundreds of users at once. (I don't care how many agent harnesses you spin up at once, you aren't going to get the same utilization as hundreds of concurrent users.)

This assessment might change if local AI frameworks start working seriously on support for tensor-parallel distributed inference, then you might get away with cheaper homelab-class hardware and only mildly unreasonable amounts of money.

[−] _pdp_ 40d ago
If you can run free models on consumer devices why do you think cloud providers cannot do the same except better and bundled with a tone of value worth paying?
[−] amelius 40d ago
A local model running on a phone owned and controlled by the vendor is still not really exciting, imho.

It may be physically "local" but not in spirit.

[−] 0dayman 40d ago
this is not that first step towards your dream
[−] kennywinker 40d ago
Did you really watch “Her” and think this is a future that should happen??

Seriously????

[−] pmarreck 40d ago
Impressive model, for sure. I've been running it on my Mac, now I get to have it locally in my iPhone? I need to test this. Wait, it does agent skills and mobile actions, all local to the phone? Whaaaat? (Have to check out later! Anyone have any tips yet?)

I don't normally do the whole "abliterated" thing (dealignment) but after discovering https://github.com/p-e-w/heretic , I was too tempted to try it with this model a couple days ago (made a repo to make it easier, actually) https://github.com/pmarreck/gemma4-heretical and... Wow. It worked. And... Not having a built-in nanny is fun!

It's also possible to make an MLX version of it, which runs a little faster on Macs, but won't work through Ollama unfortunately. (LM Studio maybe.)

Runs great on my M4 Macbook Pro w/128GB and likely also runs fine under 64GB... smaller memories might require lower quantizations.

I specifically like dealigned local models because if I have to get my thoughts policed when playing in someone else's playground, like hell am I going to be judged while messing around in my own local open-source one too. And there's a whole set of ethically-justifiable but rule-flagging conversations (loosely categorizable as things like "sensitive", "ethically-borderline-but-productive" or "violating sacred cows") that are now possible with this, and at a level never before possible until now.

Note: I tried to hook this one up to OpenClaw and ran into issues

To answer the obvious question- Yes, this sort of thing enables bad actors more (as do many other tools). Fortunately, there are far more good actors out there, and bad actors don't listen to rules that good actors subject themselves to, anyway.

[−] jeroenhd 40d ago
[−] amai 39d ago
The cooperation of Apple and Google is going to crush the competition: https://blog.google/company-news/inside-google/company-annou...

The combination of Apples hardware and Googles software is unbeatable.

[−] rock_artist 39d ago
I really believe in the future of local models.

From app developer and user, My main concern for now is bloating devices. Until we’ll have something like Apples foundation model where multiple apps could share the same model it means we have something horrible as Electron in the sense, every app is a fully blown model (browser in the electron story) instead of reusing the model.

With desktops we have DLL hell for years. But with sandboxed apps on mobile devices it becomes a bigger issue that I guess will/should be addressed by the OS.

For my app I’ve been trying to add some logic based on large model but for bloating a simple Swift app with 2-3GB of model or even few hundred MBs feels wrong doing and conflicting with code reusability concepts.

[−] al_borland 40d ago
I find it odd they are using the term “edge” to brand this, if it’s target is the general public.

I’ve been to a few tech conferences and saw the term used there for the first time. It took me a little bit to see the pattern and understand what it meant. I have never heard the term used outside of those circles. It seems like “local” would be the term average users would be familiar with. Normal people don’t call their stuff “edge devices”.

[−] dhbradshaw 40d ago
My son just started using 2B on his Android. I mentioned that it was an impressively compact model and next thing I knew he had figured out how to use it on his inexpensive 2024 Motorolla and was using it to practice reading and writing in foreign languages.
[−] TGower 40d ago
These new models are very impressive. There should be a massive speedup coming as well, AI Edge Gallery is running on GPU, but NPUs in recent high end processors should be much faster. A16 chip for example (Macbook Neo and iphone 16 series) has 35 TOPS of Neural Engine vs 7 TFLOPS gpu. Similar story for Qualcomm.
[−] allpratik 40d ago
Nice! Tried on iPhone 16 pro with 30 TPS from Gemma-4-E2B-it model.

Although the phone got considerably hot while inferencing. It’s quite an impressive performance and cannot wait to try it myself in one of my personal apps.

[−] deckar01 40d ago
It doesn’t render Markdown or LaTeX. The scrolling is unusable during generation. E4B failed to correctly account for convection and conduction when reasoning about the effects of thermal radiation (31b was very good). After 3 questions in a session (with thinking) E4B went off the rails and started emitting nonsense fragment before the stated token limit was hit (unless it isn’t actually checking).
[−] hadrien01 40d ago
Is it me or does the App Store website look... fake? The text in the header ("Productiviteit", "Alleen voor iPhone") looks pixelated, like it was edited on Paint, the header background is flickering, the app icon and screenshots are very low quality, the title of the website is incomplete ("App Store voor iPho...")
[−] two_handfuls 40d ago
The description says it's private, but the legalese it makes you agree to makes no promise. Rather, the opposite:

> We collect information about your activity in our services

Source: https://policies.google.com/privacy#infocollect

[−] orf 39d ago
I’d recommend locally.ai[1] - it’s really good and has a wide range of models. Also has shortcuts support.

1. https://apps.apple.com/gb/app/locally-ai-local-ai-chat/id674...

[−] burnto 40d ago
My iPhone 13 can’t run most of these models. A decent local LLM is one of the few reasons I can imagine actually upgrading earlier than typically necessary.
[−] carbocation 40d ago
It would be very helpful if the chat logs could (optionally) be retained.
[−] haizhung 39d ago
I encourage everybody to try this, if they have an iPhone. If you’re like me and don’t have the time to tinker with the latest and greatest all the time; this app lowers the barrier to entry significantly and provides a glimpse into what’s possible locally, on device.

Honestly, I was extremely impressed by the speed and quality of the answers considering this thing runs on a phone. It honestly makes me want to sit down and spin up my own homegrown AI setup to go fully independent. Crazy.

[−] znnajdla 30d ago
One really good use case for me is good, fast, offline translation. Both Apple Translate and Google Translate are worse in quality than a decent LLM and don’t work well offline. Gemma 4 is surprisingly good and often faster than waiting for an API call.
[−] dwa3592 40d ago
I think with this google starts a new race- best local model that runs on phones.
[−] _nagu_ 39d ago
If this works smoothly on iPhone, it could change how we think about mobile apps. Less backend dependency, more on-device intelligence.
[−] MysticOracle 40d ago
Crashes for me on a couple of different iDevices (2 generations behind) after only a few 2-3 chats. Probably not enough RAM.

Saw this one on X the other day updated with Gemma 4 and they have the built-in Apple Foundation model, Qwen3.5, and other models:

Locally AI - https://locallyai.app/

[−] davecahill 40d ago
I really like Enclave for on-device models - looks like they're about to add Gemma 4 too: https://enclaveai.app/blog/2026/04/02/gemma-4-release-on-dev...
[−] satvikpendem 40d ago
This is also on Android and has an option to use AICore with the NPU which can run much faster than even the GPU models.
[−] rudedogg 40d ago
This is fun, FYI you don’t have to sign in/up with a Google account. I hesitated downloading it for that reason.
[−] thot_experiment 40d ago
Gemma 4 E4B is an incredible model for doing all the home assistant stuff I normally just used Qwen3.5 35BA4B + Whisper while leaving me with wayy more empty vram for other bullshit. It works as a drop in replacement for all of my "turn the lights off" or "when's the next train" type queries and does a good job of tool use. This is the really the first time vramlets get a model that's reliably day to day useful locally.

I'm curious/worried about the audio capability, I'm still using Whisper as the audio support hasn't landed in llama.cpp, and I'm not excited enough to temporarily rewire my stuff to use vLLM or whatever their reference impl is. The vision capabilities of Gemma are notably (thus far, could be impl specific issues?) much much worse than Qwen (even the big moe and dense gemma are much worse), hopefully the audio is at least on par with medium whisper.

[−] danielrmay 40d ago
I spent some time getting Gemma4-e4b working via llamacpp on iPhone and I'm really impressed so far! I posted a short video of an example application on LinkedIn here https://www.linkedin.com/feed/update/urn:li:activity:7446746... (or x: https://x.com/danielrmay/status/2040971117419192553)
[−] MagicMoonlight 39d ago
It seems really capable. A few more iterations of this and you won’t even need a subscription.

All it needs is web search so that it can get up to date information.

[−] neurostimulant 40d ago
I'm able to sweet talk the gemma-4-e2b-it model in an iphone 15 to solve a hcaptcha screenshot. This small model is surprisingly very capable!
[−] XCSme 40d ago
Gemma 4 is great: https://aibenchy.com/compare/google-gemma-4-31b-it-medium/go...

I assume it is the 26B A4B one, if it runs locally?

[−] lemonish97 39d ago
I hope they add a web search tool to the agent skills too. Most of my llm usage on my phone are just quick lookups and search summarizations. Would love to do these with a local model rather than Google AI mode of any other cloud based inference tools.
[−] dzhiurgis 40d ago
I recently got to a first practical use of it. I was on a plane, filling landing card (what a silly thing these are). I looked up my hotel address using qwen model on my iPhone 16 Pro. It was accurate. I was quite impressed.

After some back and forth the chat app started to crash tho, so YMMV.

[−] totetsu 39d ago
I have been looking at ARGmax https://www.argmaxinc.com/#SDK for running on apple devices, but not sure yet at whats involved in porting a model to work with their sdk
[−] Sharmaji000 40d ago
Still didnt release training recipe, data, methodology etc unlike deepseek. Mostly released to get developer ecosystem across their android built in ai. Still good and interesting, but not exactly philanthropic to the open source progress.
[−] pseudosavant 39d ago
It'd be fun to explore creating a Gemma 4 LLM API server app so you could use your iPhone's processing for agentic coding on a laptop. I don't know how useful it would be, but it'd be fun.
[−] rotexo 40d ago
E4B is pretty good for extracting tables of items from receipt scans and inferring categories, wish this could be called from within a shortcut to just select a photo and add the extracted table to the clipboard
[−] derwiki 39d ago
I asked it about the “Altamont Free Concert” (exact name of Wikipedia article), and it’s been a while since I’ve seen an hallucination this rich. Doesn’t give me confidence to use it.
[−] gdzie-jest-sol 39d ago
I need normal server too in local network I can run chat in other device and 'counting' on iphone.

Second idea is input audio in other language, like Czech, Polish, French

[−] modeless 39d ago
It's so ridiculous that Google made a custom SoC for their phones, touting its AI performance, even calling it Tensor, and Apple is still faster at running Google's own model.

Google really ought to shut down their phone chip team. Literally every chip from them has been a disappointment. As much as I hate to say it, sticking with Qualcomm would have been the right choice.

[−] rcarmo 39d ago
This is fun. I just wish I could add more skills, the UX is too dumbed down but knowing there is a run_js tool there is a lot that can be done here.
[−] nickvec 40d ago
Extremely impressed by how fast responses are on iPhone 17 Pro Max. Can’t wait for this to be used for Siri’s brain one of these days (hopefully!)
[−] sshrajesh 39d ago

> Note: I tried to hook this one up to OpenClaw and ran into issues

Anyone worked on hooking up OpenClaw to gemma4 running locally?

[−] mc7alazoun 40d ago
Would it work locally on a Mac Pro M4 24gb? If so I'd really appreciate a step-by-step guide.
[−] rickdg 40d ago
How do these compare to Apple's Foundation Models, btw?
[−] garff 40d ago
How new of an iPhone model is needed?