Transformers Are Bayesian Networks (arxiv.org)

by Anon84 33 comments 40 points
Read article View on HN

33 comments

[−] warypet 52d ago
I found this earlier today when looking for research and ended up reporting it for citing fake sources. Please correct me if I'm wrong, but I couldn't find "[9] Jongsuk Jung, Jaekyeom Kim, and Hyunwoo J. Choi. Rethinking attention as belief propagation. In International Conference on Machine Learning (ICML), 2022." anywhere else on the internet
[−] kurthr 52d ago
Yep, nothing by even a subset of those authors. Closest paper from that Conference:

Rethinking Attention-Model Explainability through Faithfulness Violation Test Yibing Liu, Haoliang Li, Yangyang Guo, Chenqi Kong, Jing Li, Shiqi Wang

https://proceedings.mlr.press/v162/liu22i.html

https://icml.cc/virtual/2022/spotlight/18082

[−] measurablefunc 52d ago
It's "vibe" research. Most of it is basically pure nonsense.
[−] kleiba 52d ago
Care to elaborate?
[−] handedness 51d ago
The Coefficient of Sketch here feels pretty high: https://xcancel.com/gregcoppola5d
[−] getnormality 52d ago

> Transformers are the dominant architecture in AI, yet why they work remains poorly understood. This paper offers a precise answer: a transformer is a Bayesian network.

Why would being a Bayesian network explain why transformers work? Bayesian networks existed long before transformers and never achieved their performance.

[−] malcolmgreaves 56d ago

> Hallucination is not a bug that scaling can fix. It is the structural consequence of operating without concepts.

NNs are as close to continuous as we can get with discrete computing. They’re flexible and adaptable and can contain many “concepts.” But their chief strength is also their chief weakness: these “concepts” are implicit. I wonder if we can get a hybrid architecture that has the flexibility of NNs while retaining discrete concepts like a knowledge base does.

[−] westurner 56d ago
https://news.ycombinator.com/item?id=45256179 :

> Which statistical models disclaim that their output is insignificant if used with non-independent features? Naieve Bayes [...]

Ironic then, because if transformers are Bayesian networks then we're using Bayesian networks for non-independent features.

From "Quantum Bayes' rule and Petz transpose map from the minimum change principle" (2025) https://news.ycombinator.com/item?id=45074143 :

> Petz recovery map: https://en.wikipedia.org/wiki/Petz_recovery_map :

> In quantum information theory, a mix of quantum mechanics and information theory, the Petz recovery map can be thought of as a quantum analog of Bayes' theorem

But there aren't yet enough qubits for quantum LLMs: https://news.ycombinator.com/item?id=47203219#47250262

"Transformer is a holographic associative memory" (2025) https://news.ycombinator.com/item?id=43028710#43029899

[−] wklm 52d ago
I like their definition of hallucination
[−] tug2024 52d ago
[dead]