LLM Architecture Gallery

[−] libraryofbabel 62d ago

This is great - always worth reading anything from Sebastian. I would also highly recommend his Build an LLM From Scratch book. I feel like I didn’t really understand the transformer mechanism until I worked through that book.

On the LLM Architecture Gallery, it’s interesting to see the variations between models, but I think the 30,000ft view of this is that in the last seven years since GPT-2 there have been a lot of improvements to LLM architecture but no fundamental innovations in that area. The best open weight models today still look a lot like GPT-2 if you zoom out: it’s a bunch of attention layers and feed forward layers stacked up.

Another way of putting this is that astonishing improvements in capabilities of LLMs that we’ve seen over the last 7 years have come mostly from scaling up and, critically, from new training methods like RLVR, which is responsible for coding agents going from barely working to amazing in the last year.

That’s not to say that architectures aren’t interesting or important or that the improvements aren’t useful, but it is a little bit of a surprise, even though it shouldn’t be at this point because it’s probably just a version of the Bitter Lesson.

[−] imjonse 62d ago

> On the LLM Architecture Gallery, it’s interesting to see the variations between models, but I think the 30,000ft view of this is that in the last seven years since GPT-2 there have been a lot of improvements to LLM architecture but no fundamental innovations in that area.

After years of showing up in papers and toy models, hybrid architectures like Qwen3.5 contain one such fundamental innovation - linear attention variants which replace the core of transformer, the self-attention mechanism. In Qwen3.5 in particular only one of every four layers is a self-attention layer.

MoEs are another fundamental innovation - also from a Google paper.

[−] phanarch 61d ago

I'd push back slightly on the "no fundamental innovations" read though — the innovations that stuck (MoE, GQA, RoPE) are almost entirely ones that improve GPU utilization: better KV-cache efficiency, more parallelism in attention, cheaper to serve per parameter. Mamba and SSM-based hybrids are interesting but kept running into hardwar friction.

[−] iroddis 62d ago

This is amazing, such a nice presentation. It reminds me of the Neural Network Zoo [1], which was also a nice visualization of different architectures.

[1] https://www.asimovinstitute.org/neural-network-zoo/

[−] wood_spirit 62d ago

Lovely!

Is there a sort order? Would be so nice to understand the threads of evolutions and revolution in the progression. A bit of a family tree and influence layout? It would also be nice to have a scaled view so you can sense the difference in sizes over time.

[−] gasi 62d ago

So cool — thanks for sharing! Here’s a zoomable version of the diagram: https://zoomhub.net/LKrpB

[−] 7777777phil 61d ago

This is amazing, I just spent some time scrolling through these, most of the evolution is about inference cost not capability. GQA, MoE routing, sliding window attention, all trading theoretical capacity for practical efficiency.

Tbh might be the last generation of architectures designed entirely by humans. I dug into that (1) and might add another paragraph based on this if I find the time. The Big LLM Architecture Comparison (2) by Sebastian Raschka already inspired my ograph image for the blog -thanks again!

(1) https://philippdubach.com/posts/the-last-architecture-design...

(2) https://magazine.sebastianraschka.com/p/the-big-llm-architec...

[−] charcircuit 62d ago

I'm surprised at how similar all of them are with the main differences being the size of layers.

LLM Architecture Gallery (sebastianraschka.com)

41 comments