LLM Architecture Gallery

(sebastianraschka.com)

296 points | by tzury 12 hours ago

21 comments

libraryofbabel 4 hours ago
This is great - always worth reading anything from Sebastian. I would also highly recommend his Build an LLM From Scratch book. I feel like I didn’t really understand the transformer mechanism until I worked through that book.
On the LLM Architecture Gallery, it’s interesting to see the variations between models, but I think the 30,000ft view of this is that in the last seven years since GPT-2 there have been a lot of improvements to LLM architecture but no fundamental innovations in that area. The best open weight models today still look a lot like GPT-2 if you zoom out: it’s a bunch of attention layers and feed forward layers stacked up.
Another way of putting this is that astonishing improvements in capabilities of LLMs that we’ve seen over the last 7 years have come mostly from scaling up and, critically, from new training methods like RLVR, which is responsible for coding agents going from barely working to amazing in the last year.
That’s not to say that architectures aren’t interesting or important or that the improvements aren’t useful, but it is a little bit of a surprise, even though it shouldn’t be at this point because it’s probably just a version of the Bitter Lesson.
iroddis 6 hours ago
This is amazing, such a nice presentation. It reminds me of the Neural Network Zoo [1], which was also a nice visualization of different architectures.
[1] https://www.asimovinstitute.org/neural-network-zoo/
jawarner 15 minutes ago
Looks like this may have received the HN Hug of Death. I'm getting "Too Many Requests" error trying to load the images.
[-]
- brianjking 14 minutes ago
  I'm getting that trying to load the content at all, text included.
wood_spirit 7 hours ago
Lovely!
Is there a sort order? Would be so nice to understand the threads of evolutions and revolution in the progression. A bit of a family tree and influence layout? It would also be nice to have a scaled view so you can sense the difference in sizes over time.
[-]
- krackers 6 hours ago
  There is https://magazine.sebastianraschka.com/p/technical-deepseek which shows an evolution in deepseek family
gasi 6 hours ago
So cool — thanks for sharing! Here’s a zoomable version of the diagram: https://zoomhub.net/LKrpB
nxobject 4 hours ago
Thank you so much! As a (bio)statistician, I've always wanted a "modular" way to go from "neural networks approximate functions" to a high-level understanding about how machine learning practitioners have engineered real-life models.
Slugcat 5 hours ago
What tool was used to draw the diagrams?
LuxBennu 4 hours ago
Interesting collection. The architecture differences show up in surprising ways when you actually look at prompt patterns across models. Longer context windows don't just let you write more, they change what kind of input structure works best.
jasonjmcghee 3 hours ago
What's the structurally simplest architecture that has worked to a reasonably competitive degree?
[-]
- loveparade 3 hours ago
  Competitiveness doesn't really come from architecture, but from scale, data, and fine-tuning data. There has been little innovation in architecture over the last few years, and most innovations are for the purpose of making it more efficient to run training or inference (fit in more data), not "fundamentally smarter"
- bigyabai 3 hours ago
  If your definition of "competitive" is loose enough, you can write your own Markov chain in an evening. Transformer models rely on a lot of prior art that has to be learned incrementally.
  [-]
  - jasonjmcghee 3 hours ago
    Not that loose lol.
    I’m thinking it’s still llama / dense decoder only transformer.
travisgriggs 4 hours ago
Darn. I clicked here hoping we were having LLMs design skyscrapers, dams, and bridges.
I even brought my popcorn :(
arikrahman 2 hours ago
Thank you for the high quality diagrams!
jrvarela56 4 hours ago
Would be awesome to see something like this for agents/harnesses
charcircuit 5 hours ago
I'm surprised at how similar all of them are with the main differences being the size of layers.
mvrckhckr 7 hours ago
What a great idea and nice execution.
neuroelectron 4 hours ago
An older post from this blog, the linked article was updated recently: https://news.ycombinator.com/item?id=44622608
stainlu 1 hour ago
[flagged]
[-]
- lambda 40 minutes ago
  Where are you seeing dense? Most of the larger competitive models are sparse. Sure, the smaller models are dense, but over 30B it's pretty much all sparse MoE.
  And there are still plenty of hybrid architectures. Nemotron 3 Super 120B A12B just came out, it's mostly Mamba with a few attention layers, and it's pretty competitive for its size class.
  But yeah, these different architectures seem to be relatively small micro-optimizations for how it performs on different hardware or difference in tradeoffs for how it scales with the context window, but most of the actual differentiation seems to be in training pipeline.
  We are seeing substantial increases in performance without continuing to scale up further, we've hit 1T parameters in open models but are still having smaller models outperform that with better and better training pipelines.
useftmly 3 hours ago
[dead]
isotropic 7 hours ago
[dead]
docybo 8 hours ago
[dead]
SideLineLabs 8 hours ago
[flagged]
FailMore 9 hours ago
Thanks! This is cool. Can you tell me if you learnt anything interesting/surprising when pulling this together? As in did it teach you something about LLM Architecture that you didn't know before you began?