Let's say you're a company that's providing an internet connection to a business. The company trusts you, so there's only compression of bits over the wire, not encryption, and you're aware of the compression scheme the company is using to send their bits to you. You're charging the company a premium for using the line you manage but you also lease the line, so it's in your interest to compress what they give you as best as possible so as to make a profit.
Say the companies compression scheme is imperfect. They have a Huffman coding of their (imperfect) model of tokens they send, call it q(x) (that is, they think token x shows up with probability q(x)). You've determined the true distribution, p(x) (token x shows up with actual probability p(x)).
The business has tokens that show up with probability p(x) but they encode them with lg(q(x)) bits, giving an average token bit size of:
-\sum _ x p(x) lg(q(x))
If you then use an optimal Huffman encoding, you will send tokens with average bit length of:
-\sum _ x p(x) lg(p(x))
How many bits, on average, do you save? Just the difference:
-\sum _ x p(x) lg(p(x)) - \sum _ x p(x) lg(q(x)) = -\sum _ x p(x) lg(p(x)/q(x))
Which is the Kullback-Leibler divergence.
To me, this is a much more intuitive explanation. I made a blog post about it [0], if anyone cares.
To rephrase what you wrote in plain English: you are Amazon, a client uses an S3 bucket to store .zip files in them, which they pay by the byte, you re-compress and store the data as .7z files, and the KL divergence is related to zip_file_size - 7z_file_size, your "win".
Wow this is really great. I just realised last weak that MLE can be motivated with the KL divergence between true distribution and approximation. My mind was blown in how obvious that connection was.
If you ask me the quickest way to explain KL divergence is like such:
If two distributions are the same KL becomes 0.
KL quantifies how many nats of difference there is between a target and a source.
It’s always good to read through the original information theoretic work. Most of AI is copycats with more compute anyways.
Unfortunately all these intuitions rely on a distinction between a "true" distribution P and a "false" distribution Q. So they don't work for a subjective probability interpretation where it doesn't make sense to speak of a true or false distribution.
KL(P||Q) penalizes Q heavily when it assigns low probability to things P considers likely, but barely cares when Q wastes probability on rare events. That's why KL regularization in RLHF pushes models toward typical, average-sounding outputs..
For those wondering where is this practically relevant - this is the basic metric used to compare quantization of various LLM models - what is the KL divergence of a 4-bit quantization versus an 8 bit one versus the original 16 bit one.
> D(P||Q) = measure of how much our model Q differs from the true distribution P. In other words, we care about how much P and Q differ from each other
in the world where P is true, which explains why KL-div is not symmetric.
I don't think this particular interpretation actually makes sense or would explain why KL divergence is not symmetric.
First of all, the "difference" between P and Q would be the same independently of whether P, Q, or some other distribution is the "true" distribution.
For example, assume we have a coin and P(Heads)=0.4 and Q(Heads)=0.6. Now the difference between the two distributions is clearly the same irrespective of whether P, Q or neither is "true". So this interpretation doesn't explain why the KL divergence is asymmetric.
Second, there are plausible cases where it arguably doesn't even make sense to speak of a "true" distribution in the first place.
For example, consider the probability that there was once life on Mars. Assume P(Life)=0.4 and Q(Life)=0.6. What would it even mean for P to be "true"? P and Q could simply represent the subjective beliefs of two different people, without any requirement of assuming that one of these probabilities could be "correct".
Clearly the KL divergence can still be calculated and presumably sensibly interpreted even in the subjective case. But the interpretations in this article don't help us here since they require objective probabilities where one distribution is the "true" one.
Nice writeup. One thing I've been exploring is how information-theoretic measures
connect to physics — specifically, the KL divergence between a "true" vacuum
distribution and a perturbed one gives you coupling constants. In the Fibonacci-
structured potential V(s) = v⁴(s−s₀)²/(1−s−s²), the strong coupling αₛ = 1/(2φ³)
emerges exactly as the curvature at the vacuum divided by 2. The information-
geometric interpretation is that αₛ measures how "distinguishable" the vacuum is
from the pole — a Fisher metric on the space of potentials.
Probably a stretch, but it's interesting how divergence measures keep showing up
in unexpected places.
33 comments
Let's say you're a company that's providing an internet connection to a business. The company trusts you, so there's only compression of bits over the wire, not encryption, and you're aware of the compression scheme the company is using to send their bits to you. You're charging the company a premium for using the line you manage but you also lease the line, so it's in your interest to compress what they give you as best as possible so as to make a profit.
Say the companies compression scheme is imperfect. They have a Huffman coding of their (imperfect) model of tokens they send, call it q(x) (that is, they think token x shows up with probability q(x)). You've determined the true distribution, p(x) (token x shows up with actual probability p(x)).
The business has tokens that show up with probability p(x) but they encode them with lg(q(x)) bits, giving an average token bit size of:
If you then use an optimal Huffman encoding, you will send tokens with average bit length of: How many bits, on average, do you save? Just the difference: Which is the Kullback-Leibler divergence.To me, this is a much more intuitive explanation. I made a blog post about it [0], if anyone cares.
[0] https://mechaelephant.com/dev/Kullback-Leibler-Divergence.ht...
Apologies for the snark but I can't fathom how someone who is aware of the definition of KL not see the likelihood in it.
> So minimising the cross entropy over theta is the same as maximising KL(P,Q)
Minimising*
> D(P||Q) = measure of how much our model Q differs from the true distribution P. In other words, we care about how much P and Q differ from each other
in the world where P is true, which explains why KL-div is not symmetric.I don't think this particular interpretation actually makes sense or would explain why KL divergence is not symmetric.
First of all, the "difference" between P and Q would be the same independently of whether P, Q, or some other distribution is the "true" distribution.
For example, assume we have a coin and P(Heads)=0.4 and Q(Heads)=0.6. Now the difference between the two distributions is clearly the same irrespective of whether P, Q or neither is "true". So this interpretation doesn't explain why the KL divergence is asymmetric.
Second, there are plausible cases where it arguably doesn't even make sense to speak of a "true" distribution in the first place.
For example, consider the probability that there was once life on Mars. Assume P(Life)=0.4 and Q(Life)=0.6. What would it even mean for P to be "true"? P and Q could simply represent the subjective beliefs of two different people, without any requirement of assuming that one of these probabilities could be "correct".
Clearly the KL divergence can still be calculated and presumably sensibly interpreted even in the subjective case. But the interpretations in this article don't help us here since they require objective probabilities where one distribution is the "true" one.
Probably a stretch, but it's interesting how divergence measures keep showing up in unexpected places.