I found this earlier today when looking for research and ended up reporting it for citing fake sources. Please correct me if I'm wrong, but I couldn't find "[9] Jongsuk Jung, Jaekyeom Kim, and Hyunwoo J. Choi. Rethinking attention as belief
propagation. In International Conference on Machine Learning (ICML), 2022." anywhere else on the internet
> Transformers are the dominant architecture in AI, yet why they work remains poorly understood. This paper offers a precise answer: a transformer is a Bayesian network.
Why would being a Bayesian network explain why transformers work? Bayesian networks existed long before transformers and never achieved their performance.
> Hallucination is not a bug that scaling can fix. It is the structural consequence of operating without concepts.
NNs are as close to continuous as we can get with discrete computing. They’re flexible and adaptable and can contain many “concepts.” But their chief strength is also their chief weakness: these “concepts” are implicit. I wonder if we can get a hybrid architecture that has the flexibility of NNs while retaining discrete concepts like a knowledge base does.
> In quantum information theory, a mix of quantum mechanics and information theory, the Petz recovery map can be thought of as a quantum analog of Bayes' theorem
33 comments
Rethinking Attention-Model Explainability through Faithfulness Violation Test Yibing Liu, Haoliang Li, Yangyang Guo, Chenqi Kong, Jing Li, Shiqi Wang
https://proceedings.mlr.press/v162/liu22i.html
https://icml.cc/virtual/2022/spotlight/18082
> Transformers are the dominant architecture in AI, yet why they work remains poorly understood. This paper offers a precise answer: a transformer is a Bayesian network.
Why would being a Bayesian network explain why transformers work? Bayesian networks existed long before transformers and never achieved their performance.
> Hallucination is not a bug that scaling can fix. It is the structural consequence of operating without concepts.
NNs are as close to continuous as we can get with discrete computing. They’re flexible and adaptable and can contain many “concepts.” But their chief strength is also their chief weakness: these “concepts” are implicit. I wonder if we can get a hybrid architecture that has the flexibility of NNs while retaining discrete concepts like a knowledge base does.
> Which statistical models disclaim that their output is insignificant if used with non-independent features? Naieve Bayes [...]
Ironic then, because if transformers are Bayesian networks then we're using Bayesian networks for non-independent features.
From "Quantum Bayes' rule and Petz transpose map from the minimum change principle" (2025) https://news.ycombinator.com/item?id=45074143 :
> Petz recovery map: https://en.wikipedia.org/wiki/Petz_recovery_map :
> In quantum information theory, a mix of quantum mechanics and information theory, the Petz recovery map can be thought of as a quantum analog of Bayes' theorem
But there aren't yet enough qubits for quantum LLMs: https://news.ycombinator.com/item?id=47203219#47250262
"Transformer is a holographic associative memory" (2025) https://news.ycombinator.com/item?id=43028710#43029899