Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon

[−] simonw 52d ago

Suggestion for the maintainers: the comparison table currently lists some pretty old models, Qwen 2.5 14B and Mixtral 8x7B and Llama 3.3 70B.

A lot of people are reporting incredible results with the Qwen 3.5 MoE models on Apple hardware right now (streaming experts - see https://simonwillison.net/2026/Mar/24/streaming-experts/) - it would be great to get some of those models into that table.

Maybe the 1T parameter Kimi K2.5 too if you can get that to work, see https://twitter.com/seikixtc/status/2036246162936910322 and https://twitter.com/danpacary/status/2036480556045836603

[−] vanyaland 53d ago

For a lot of local workloads, sub-1 tok/s is useless in foreground and perfectly acceptable in background. If the choice is “this crashes” vs “this finishes overnight,” that’s still a meaningful capability jump.

[−] marksully 53d ago

Where does "1T parameter model" come from? I can only see models with 70B params or less mentioned in the repo.

[−] baq 53d ago

Intel Optane rolling in its grave.

[−] shubhamintech 52d ago

The MoE point matters here ie sparse activation means you're not reading all 2TB per forward pass, but the access pattern flips from sequential to random which is exactly the worst case for NVMe. Been thinking about this a lot for agent inference workloads where you want consistent latency more than peak throughput.

[−] Insanity 53d ago

This is a pretty cool project! Essentially this is like using Swap memory to extend your RAM, but in a 'smart' way so you don't overload the NVMe unnecessarily.

I do wonder in practice how the 'smarts' pan out, because putting a ton of stress on your NVMe during generation is probably not the best choice for it's longevity.

[−] msbhogavi 52d ago

"As much memory as possible" is right for model capacity but misses bandwidth. Apple Silicon has distinct tiers: M4 Pro at 273 GB/s, M4 Max at 546 GB/s, M4 Ultra at 819 GB/s. Bandwidth determines tok/s once the model fits in memory. An M4 Max gives you 2x the decode speed of an M4 Pro on the same model.

For what Hypura does, the Max is the sweet spot. 64GB loads a 70B at Q4 with room to spare, and double the bandwidth of the Pro means generation is actually usable instead of just technically possible.

[−] zozbot234 53d ago

It will be interesting to compare this to https://news.ycombinator.com/item?id=47476422 and https://news.ycombinator.com/item?id=47490070 . Very similar design except that this is apparently using mmap, which according to the earlier experiment incurs significant overhead.

[−] astrange 52d ago

> Consumer hardware (MacBook Pro, Mac Studio) ships with fast unified memory and NVMe storage, but limited capacity. A 32 GB M1 Max cannot naively load a 40 GB model — the OS will swap-thrash until the OOM killer intervenes.

macOS doesn't have an "OOM killer" in that sense. (It has an out of swap space killer but it's pretty weak.)

So what will happen is, either your memory wiring will fail, or else it will get really slow and panic.

[−] dev_tools_lab 52d ago

Nice work on the scheduler. Have you benchmarked parallel inference across multiple models? Running GPT, Claude and Gemini simultaneously on the same input is where latency becomes a real constraint.

[−] dev_tools_lab 51d ago

Thanks for this project. Prioritizing MoE models and adding an intelligent NVMe cache could improve efficiency, especially on the M4 Max where bandwidth makes usage more realistic.

[−] EnPissant 53d ago

You do not provide any comparison to llama.cpp with mmap.

You do not explain how any kind of predictor can work for MoE experts.

You do not explain how prediction can even be useful. I can predict the layers used in a dense model (all of them are used in order), but that doesn't help me much. It's still bottlenecked on bandwidth (hint: MoE doesn't change this).

[−] root_axis 53d ago

Are there any 1T parameter open source models?

[−] dangoodmanUT 52d ago

With unified memory and such a strong os-hardware integration, one would hope that swap could handle this task

[−] nullbyte 53d ago

I am curious how the TPS compares vs default OS virtual memory paging

[−] speedgoose 53d ago

I wonder how many minutes per token on GLM 5.

[−] amelius 53d ago

This is <1 tok/s for the 40GB model.

Come on, "Run" is not the right word. "Crawl" is.

Headlines like that are misleading.

[−] monksy 53d ago

There needs to be something like this from Ollama. At the moment Ollama has a lot of flaws that prevent it from getting great performance. (My understanding is better GPU/CPU splits, etc). But Ollama is the only way to host an LLM and have it switch out on demand. Sigh.

[−] solozaki 51d ago

hello

[−] vicchenai 53d ago

[dead]

[−] anshulbasia27 53d ago

OS paging would be significantly worse here. The kernel's page fault handler is reactive — it doesn't know you're about to read layer 47's FFN weights, so it can't prefetch. You stall on every fault, wait for the 4KB/16KB page to load, then resume. With 80 layers of dense FFN streaming, that's thousands of cold faults per token.

  What makes this approach faster is that the model's access pattern is completely deterministic during         
  inference. You know exactly which tensors are needed next because transformer layers execute sequentially. So
  you can issue large sequential reads and prefetch the next layer while the current one is computing on Metal. 
  The OS page cache can't do that — it has no concept of "layer N+1 comes after layer N."

  For MoE it's even more stark. The OS would page in all 8 experts on the first token that routes to each one,  
  then evict them under memory pressure with LRU, which has no idea that expert 3 fires 10x more often than
  expert 7. The neuron cache here is basically a domain-specific replacement policy.

[−] pugchat 51d ago

[dead]

[−] Yanko_11 53d ago

[dead]

[−] anshulbasia27 53d ago

[dead]

[−] a7om_com 51d ago

[flagged]

[−] aplomb1026 52d ago

[dead]

[−] leontloveless 52d ago

[dead]

[−] skillflow_ai 52d ago

[dead]

[−] jee599 53d ago

[dead]

[−] paxrel_ai 52d ago

[dead]

[−] jeff_antseed 52d ago

[dead]

Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon (github.com)

85 comments