iPhone 17 Pro Demonstrated Running a 400B LLM (twitter.com)

by anemll 326 comments 713 points
Read article View on HN

326 comments

[−] cmiles8 53d ago
To the extent that the present LLM movement reaches a steady state conclusion it’s highly likely to be open source models on your own hardware that are “good enough” for 95% of use cases.

That blows up the whole “industrial complex” being developed around massive data centers, proprietary models, and everything that goes with that. Complete implosion.

Apple has sat on the sidelines for much of this as it seems clear they know the end game is everyone just does this stuff locally on their phone or computer and then it’s game over for everything going on now.

[−] draxil 53d ago
I assume you mean open weight models? I wish we had better open source models. It would make LLMs far less icky if we had nice clean open trained models. A breakthrough on the cost of training would be nice.
[−] mr_toad 53d ago
Still need massive amounts of compute for training. Nobody is going to be training 400B models on a phone any time soon.
[−] firstbabylonian 54d ago

> SSD streaming to GPU

Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?

1: https://arxiv.org/abs/2312.11514

[−] CrzyLngPwd 54d ago
I had a dream that everyone had super intelligent AIs in their pockets, and yet all they did was doomscroll and catfish...shortly before everything was destroyed.
[−] andix 54d ago
My iPad Air with M2 can run local LLMs rather well. But it gets ridiculously hot within seconds and starts throttling.
[−] yencabulator 53d ago
Qwen3.5-397B-A17B behaves more like a 17B parameter model. Omitting the MoE part from the headline makes it a lie and stupid hype.

Quantizing is also a cheat code that makes the numbers lie, next up someone is going to claim running a large model when they're running a 1-bit quantization of it.

[−] EruditeCoder108 53d ago
This is less about “running a 400B model on a phone” and more about clever engineering around constraints. What’s actually happening is: in mixture-of-experts only a small subset of weights is active per token Aggressive quantization Streaming weights from storage instead of loading everything into RAM So the effective working set is much smaller than 400B. That said, the trade-offs are obvious: very low token throughput, high latency, and heavy reliance on storage bandwidth. It’s more of a proof-of-concept than something usable.
[−] cj00 54d ago
It’s 400B but it’s mixture of experts so how many are active at any time?
[−] lainproliant 54d ago
This reminds me of how excited people were to get models running locally when llama.c first hit.
[−] russellbeattie 54d ago
I have some macro opinions about Apple - not sure if I'm correct, but tell me what you think.

Apple has always seen RAM as an economic advantage for their platform: Make the development effort to ensure that the OS and apps work well with minimal memory and save billions every year in hardware costs. In 2026, iPhones still come with 8Gb of RAM, Pro/Max come with 12Gb.

The problem is that AI (ML/LLM training and inference) are areas where you can't get around the need for copious amounts of fast working memory. (Thus the critical shortage of RAM at the moment as AI data centers consume as many memory chips as possible.)

Unless there's something I don't know (which is more than possible) Apple can't code their way around this problem, nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM.

So, it's going to be interesting whether they accept this reality and we start seeing the iPhones in the future with 16Gb, 32Gb or more as standard in order to make AI performant. And if they give up on adding AI to the billions of iPhones with minimal RAM already out there.

As a side note, 8Gb of RAM hasn't been enough for a decade. It prevents basic tasks like keeping web tabs live in the background. My pet peeve is having just a few websites open, and having the page refresh when swapping between them because of aggressive memory management.

To me, Apple's obvious strength is pushing AI to the edge as much as possible. While other companies are investing in massive data centers which will have millions of chips that will be outdated within the next couple years, Apple will be able to incrementally improve their ML/AI features by running on the latest and greatest chips every year. Apple has a huge advantage in that they can design their chips with a mega high speed bus, which is just as important as the quantity of RAM.

But all that depends on Apple's willingness to accept that RAM isn't an area they can skimp on any more, and I'm not sure they will.

Sorry for the brain dump. I'd love to be educated on this in case I'm totally off base.

[−] alnah 53d ago
It's a nice experiment, but I really wonder what's the use case? Privacy, yes. Local, yes. But then? Will people really use an LLM in their iPhone while they can use LLM infrastructure with bigger models for complex tasks? I mean, it really looks cool. But I don't think it's gonna be the future of local AI also. Maybe someone who can build up a very specialized local model for one particular task can enjoy that. Not sure it's gonna be massively use by the common of the mortals... But fore sure, for the industry, there is maybe a direction where we could have different very specialized models, on our devices, that could interoperate together, and then, provide something useful. We'll see. Interesting though! Maybe we still need some years, or decades, before we have devices, laptops, good enough to run good models.
[−] illwrks 54d ago
I installed Termux on an old Android phone last week (running LineageOS), and then using Termux installed Ollama and a small model. It ran terribly, but it did run.
[−] PinkMilkshake 53d ago
"That is a profound observation, and you are absolutely right..."

With all the money you will save on subscription fees you should be able to afford treatment for your psychosis!

[−] pshc 53d ago
Even though it's quantized-to-hell Mixture of Experts, honestly, it's crazy this model can run semi-coherently on an phone.
[−] redwood 54d ago
It will be funny if we go back to lugging around brick-size batteries with us everywhere!
[−] groby_b 53d ago
For small values of "running".

Don't get me wrong, it's an awesome achievement, but 0.6s token/s at presumably fairly heavy compute (and battery), on a mobile device? There aren't too many use cases for that :)

[−] _air 54d ago
This is awesome! How far away are we from a model of this capability level running at 100 t/s? It's unclear to me if we'll see it from miniaturization first or from hardware gains
[−] lofaszvanitt 53d ago
I miss the old days when words appear one by one, just like images line by line in old modem days.
[−] causal 54d ago
Run an incredible 400B parameters on a handheld device.

0.6 t/s, wait 30 seconds to see what these billions of calculations get us:

"That is a profound observation, and you are absolutely right ..."

[−] avazhi 53d ago
Qwen's MoE models are god awful when they are only running 2B parameters or whatever they downscale to while active. It isn't a 400B model if there's only several orders of magnitude less parameters active when you're actually inferencing...
[−] echelon 53d ago
"0.6 t/s"

This is a toy.

We need to build open infrastructure in the cloud capable of hosting a robust ecosystem of open weights.

And then we need to build very large scale open weights.

That's the only way we don't get owned by the hyperscalers.

At the edge isn't going to happen in a meaningful way to save us.

[−] einpoklum 53d ago
I read this title as: "iPhone 17 Pro demonstrated being an overpriced phone".
[−] nailer 53d ago
[−] kampak212 46d ago
I run Qwen 2.5 gguf on an iPhone 16e on production. Handful of them. They’re on the AppStore.
[−] fudged71 53d ago
If you don't follow anemll, they also have a usable version of OpenClaw running on iPhone.

With hardware and model improvements, the future is bright.

[−] vedaba 53d ago
I just use mine to doomscroll on Instagram and look at the fluorescent orange color like I’m holding lava
[−] gulugawa 53d ago
This sounds incredibly dangerous.

Local LLMs are going to make people sit on their phones instead of taking to real people.

[−] smlacy 53d ago
Total gimmick. I guess we're "making progress", but this is will never lead to any useful application other than "Yes, you're absulotely right" bots. What's needed for real applications is 10000× the input token context and 10× the output token speed, so we're off by a factor of ... 100,000×?
[−] HardCodedBias 54d ago
The power draw is going to be crazy (today).

Practical LLMs on mobile devices are at least a few years away.

[−] yalogin 54d ago
Apple’s unified memory architecture plays a huge part in this. This will trigger a large scale rearchitecture of mobile hardware across the board. I am sure they are already underway.

I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?

[−] r4m18612 54d ago
Impressive. Running a 400B model on-device, even at low throughput, is pretty wild.
[−] dv_dt 54d ago
CPU, memory, storage, time tradeoffs rediscovered by AI model developers. There is something new here, add GPU to the trade space.
[−] ashwinnair99 54d ago
A year ago this would have been considered impossible. The hardware is moving faster than anyone's software assumptions.
[−] zharknado 53d ago
“Flash” MOE is named for the sloth character in Zootopia I presume?
[−] latexr 53d ago
You can really feel the sycophantic drivel when it’s coming at 0.6 tokens per second.

> That is a profound observation, and you are absolutely right

Twenty seconds and a hot phone for that.

In the end it took almost four minutes to generate under 150 tokens of nothing.

Impressive that they got it to run, but that’s about the only thing.

[−] gnarlouse 53d ago
It's like the sloth from Zootopia
[−] skiing_crawling 54d ago
I can't understand why this is a surprise to anyone. An iphone is still a computer, of course it can run any model that fits in storage albiet very slowly. The implementation is impressive I guess but I don't see how this is a novel capability. And for 0.6t/s, its not a cost efficient hardware for doing it. The iphone can also render pixar movies if you let it run long enough, mine bitcoin with a pathetic hashrate, and do weather simulations but not in time for the forecast to be relevant.
[−] konaraddi 53d ago
How? Are there instructions?
[−] 1970-01-01 54d ago
"400 bytes should be enough for anybody"
[−] gary_cli 53d ago
good
[−] pugchat 53d ago
[dead]
[−] ComputeLeap 53d ago
[dead]
[−] aplomb1026 53d ago
[dead]
[−] philbitt 53d ago
[dead]
[−] Nahid890 53d ago
[dead]
[−] Yanko_11 54d ago
[dead]