The way I understand this is that adding of random variables is a smoothening operation on their densities (more generally the distributions, but let me speak of densities only).
A little more formally, additions over random variables are convolutions of their densities. Repeated additions are repeated convolutions.
A single convolution can be understood as a matrix multiplication by a specific symmetric matrix. Repeated convolutions are therefore repeated matrix multiplications.
Anyone familiar with linear algebra will know that repeated matrix multiplication by a non degenerate matrix reveals it's eigenvectors.
The Gaussian distribution is such an eigenvector. Just like an eigenvector, it is also a fixed point -- multiplying again by the same matrix wil lead to the same vector, just scaled. The Gaussian distribution convolved is again a Gaussian distribution.
The addition operation in averaging is a matrix multiplication in the distribution space and the division by the the 'total' in the averaging takes care of the scaling.
Linear algebra is amazing.
Pagerank is an eigenvector of the normalised web adjacency matrix. Gaussian distribution is the eigenvector of the infinite averaging matrix. Essentially the same idea.
Convolution alone does not smooth. Eg consider a random variable supported on the pts 0 and 1 (delta masses at 2 pts.) No matter how many convolutions you do, you still have support on integers - not smooth at all. You need appropriate rescaling for a gaussian.
Also, convolving a distribution with itself is NOT a linear operation, hence cannot be described by a matrix multiplication with a fixed matrix.
If you are not speaking in jest (I strongly suspect you are), knowledge of linear algebra is one of the biggest bang for buck one can get as an investment in mathematical knowledge.
So humble and basic a field. So wide it's consequences and scope.
Their point was that "familiarity" apparently means different things for different people :P Someone using linalg in computer graphics applications may say they're familiar with it even though they've never heard the term "eigenvector". I'm not actually sure about what you mean – how does repeated multiplication reveal eigenvectors?
Ah… that "diagonalizable" is doing some heavy lifting there! I was wondering how exactly you’re going to make, say, a rotation matrix to converge anything to anything that’s not already an eigenvector. And rotation matrices certainly aren’t degenerate! Though apparently non-diagonalizable matrices can be called defective which is such a dismissive term :( Poor rotation matrices, why are they dissed so?!
Take logarithm of the eigenvalues and you get back the angle. This to me had solidified the notion that angles are essentially a logarithmic notion ... Made more rigorous by the notion of exponential maps
My first sentence was in jest. I've used LA for various things, but haven't had many dealings with eigenvectors. So that information was genuinely new to me.
I remember my professor talking about eigenvectors in Linear Algebra and it's been 50 years - though I barely remember anything else from that class. It was taught very early on in the course and eventually we used them all the time to solve problems.
Great article. Personally I have been learning more about the mathematics of beyond-CLT scenarios (fat tails, infinite variance etc)
The great philosophical question is why CLT applies so universally. The article explains it well as a consequence of the averaging process.
Alternatively, I’ve read that natural processes tend to exhibit Gaussian behaviour because there is a tendency towards equilibrium: forces, homeostasis, central potentials and so on and this equilibrium drives the measurable into the central region.
For processes such as prices in financial markets, with complicated feedback loops and reflexivity (in the Soros sense) the probability mass tends to ends up in the non central region, where the CLT does not apply.
This applies even when the variance is not finite.
Note independence and identical nature of distribution is not necessary for Central Limit Theorem to hold. It is a sufficient condition, not a necessary one, however, it does speed up the convergence a lot.
Gaussian distribution is a special case of the infinitely divisible distribution and is the most analytically tractable one in that family.
Whereas, averaging gives you Gaussian as long as the original distribution is somewhat benign, the MAX operator also has nice limiting properties. They converge to one of three forms of limiting distributions, Gumbel being one of them.
The general form of the limiting distributions when you take MAX of a sufficiently large sample are the extreme value distributions
The article doesn't share the actual math, but also not the relatively easy intuition. When you roll a pair of dice, there are more combinations that add up to 7 than any other number. Change the numbers on the dice (change the 1 to a 6, e.g.), there's again more combinations that add up to some numbers than to others. The histogram of the number of combinations that add up to different results is a bell curve. That's why it pops up everywhere you have addition of independent events.
It's sad that even introductory statistics courses skip this simple intuition.
"Order in Apparent Chaos.-I know of scarcely any-, thing so apt to impress the imagination as the wonderful form of cosmic order expressed by the " Law of Frequency of Error." The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and • in complete self-effacement amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason."
On opening the article, I was somehow expecting a mention of the large deviations formalism, which was (is?) fashionable in late 20th century, and gives a nice information theoretic view of the CLT. Or something like that. There's a ton of deep math there. So having a bio statistician say "look, the CLT is cool" is a bit underwhelming.
Edit: see eg John Baez's write-up What is Entropy? about the entropy maximization principle, where gaussians make an entrance.
Causes mostly add up: molecular kinetic energies aggregate to temperature, collisions to pressure, imperfections to measurement errors, etc.
So, normal or CLT is the attractor state for the unexceptional world.
BUT for the exceptional world, causes multiply or cascade: earthquake magnitudes, network connectivity, etc. So, you get log-normal or fat-tailed.
Sorry, does the article actually give reasons why the bell curve is "everywhere"?
For simplicity, take N identically distributed random variables that are uniform on the interval from [-1/2,1/2], so the probability distribution function, f(x), on the interval from [-1/2,1/2] is 1.
The Fourier transform of f(x), F(w), is essentially sin(w)/w. Taking only the first few terms of the Taylor expansion, ignoring constants, gives (1-w^2).
Convolution is multiplication in Fourier space, so you get (1-w^2)^n. Squinting, (1-w^2)^n ~ (1-n w^2 / n)^n ~ exp(-n w^2). The Fourier transform of a Gaussian is a Gaussian, so the result holds.
Unfortunately I haven't worked it out myself but I've been told if you fiddle with the exponent of 2 (presumably choosing it to be in the range of (0,2]), this gives the motivation for Levy stable distributions, which is another way to see why fat-tailed/Levy stable distributions are so ubiquitous.
This is one of my favorite philosophical questions to ponder. I always ask it in interviews as a warmup to get their thoughts. I’ve noticed that interviewees often curl up, thinking it’s a technical question, so I’ve been modifying the question one after the other to make it less scary. The interviews are for data scientist roles.
> suppose that a large sample of observations is obtained, each observation being randomly produced in a way that does not depend on the values of the other observations, and the average (arithmetic mean) of the observed values is computed. If this procedure is performed many times, resulting in a collection of observed averages, the central limit theorem says that if the sample size is large enough, the probability distribution of these averages will closely approximate a normal distribution.
A requirement is multiple independent influences. An example of what shouldn't target a normal distribution are a single course's grade outcomes, having a teacher and a defined curriculum goes against that. Yes, there is a variability of student effort and aptitude. But a top teir university selects a group of students based on some merit their student body isn't random. There are airheads who were dragged over the finish line with connections and family money and some students fall prey to substance abuse and mental illness. I argue a different distribution recognizing that a skilled teacher can get a class grade distribution centered around at least a B of not B+, A-. I feel grading on the curve and limiting A's to a fixed percent target can encourage bad test design or worse bad grading.
> Place a measuring cup in your backyard every time it rains and note the height of the water when it stops: Your data will conform to a bell curve.
That strikes me as unlikely, actually: that the amount of water to fall (per area) across rain showers ("when it stops") is normally distributed. Why would the author think that?
Also, not much of "the math that explains" the CLT in the article. The basic conditions are:
The samples you add together must be
- sufficiently independent
- sufficiently well-behaved in the sense of not having huge outliers (finite variance is good enough for this)
Hot take: bell curves are everywhere exactly because the math is simple.
The causal chain is: the math is simple -> teachers teach simple things -> students learn what they're taught -> we see the world in terms of concepts we've learned.
The central limit theorem generalizes beyond simple math to hard math:
Levy alpha stable distributions when variance is not finite, the Fisher-Tippett-Gnedenko theorem and Gumbel/Fréchet/Weibull distributions regarding extreme values. Those curves are also everwhere, but we don't see them because we weren't taught them because the math is tough.
Okay at my core I'm an inductionist. However this article is a mere tautology at best.
The article doesn't explain why. It explains a bunch of cases and works backwards to show that the original premise was true. This sounds fine but the end of the article specifically mentioned that this is dangerous because the world doesn't always work like this.
This is the problem with induction, it might work in 99% of cases, I've never seen a Black Swan so there must not be any black swans?
Deduction has more value when it comes to math specifically... I'll admit that as an inductionist.
A little disappointing. All about the history of bell curves, but I don't think it does a very good job explaining why the bell curve appears or the CLT is as it is.
I flinch at "everywhere", particularly when people keep asserting they are places that they aren't (and in fact can't be). Nothing with a hard zero can be normally distributed, for instance, but people will keep insisting quantities with a hard zero are.
100 year floods are not happening more often in most cases - it is just that the central limit therom teachs us the 10 year flood is almost as high water as the 100 or even 1000 year flood.
It's not a bad article, but I have to point something out:
> Laplace distilled this structure into a simple formula, the one that would later be known as the central limit theorem. No matter how irregular a random process is, even if it’s impossible to model, the average of many outcomes has the distribution that it describes. “It’s really powerful, because it means we don’t need to actually care what is the distribution of the things that got averaged,” Witten said. “All that matters is that the average itself is going to follow a normal distribution.”
This is not really true, because the central limit theorem requires a huge assumption: that the random process has finite variance. I believe that distributions that don't satisfy that assumption, which we can call heavy-tailed distributions, are much more common in the real world than this discussion suggests. Pointing out that infinities don't exist in the real world is also missing the point, since a distribution that just has a huge but finite variance will require a correspondingly huge number of samples to start behaving like a normal distribution.
Apart from the universality, the normal distribution has a pretty big advantage over others in practice, which is that it leads to mathematical models that are tractable in practice. To go into a slightly more detail, in mathematical modeling, often you define some mathematical model that approximates a real-world phenomenon, but which has some unknown parameters, and you want to determine those parameters in order to complete the model. To do that, you take measurements of the real phenomenon, and you find values for the parameters that best fit the measurements. Crucially, the measurements don't need to be exact, but the distribution of the measurement errors is important. If you assume the errors are independent and normally distributed, then you get a relatively nice optimization problem compared to most other things. This is, in my opinion, about as much responsible for the ubiquity of normal distributions in mathematical modeling as the universality from the central limit theorem.
However, as most people who solve such problems realize, sometimes we have to contend with these things called "outliers," which by another name are really samples from a heavy-tailed distribution. If you don't account for them somehow, then Bad Things(TM) are likely to happen. So either we try to detect and exclude them, or we replace the normal distribution with something that matches the real data a bit better.
Anyway, to connect this all back to the central limit theorem, it's probably fair to say measurement errors tend to be the combined result of many tiny unrelated effects, but the existence of outliers is pretty strong evidence that some of those effects are heavy-tailed and thus we can't rely on the central limit theorem giving us a normal distribution.
a vast amount of fluff for less than a college statistics professor would (hopefully) be able to impart with a chalkboard in 10 minutes, when Quanta has the ability to prepare animated diagrams like 3Blue1Brown but chooses not to use it
they could go down myriad paths, like how it provides that random walks on square lattices are asymptotically isotropic, or give any other simple easy-to-understand applications (like getting an asymptotic on the expected # of rolls of an n-sided die before the first reoccurring face) or explain what a normal distribution is, but they only want to tell a story to convey a feeling
they are a blight upon this world for not using their opportunity to further public engagement in a meaningful way
Bell curves are everywhere because all distributions of any properties clump in some way at some level. The basics of any probability shows this. The result is you “seeing” bell curves everywhere. Aka clumps.
125 comments
A little more formally, additions over random variables are convolutions of their densities. Repeated additions are repeated convolutions.
A single convolution can be understood as a matrix multiplication by a specific symmetric matrix. Repeated convolutions are therefore repeated matrix multiplications.
Anyone familiar with linear algebra will know that repeated matrix multiplication by a non degenerate matrix reveals it's eigenvectors.
The Gaussian distribution is such an eigenvector. Just like an eigenvector, it is also a fixed point -- multiplying again by the same matrix wil lead to the same vector, just scaled. The Gaussian distribution convolved is again a Gaussian distribution.
The addition operation in averaging is a matrix multiplication in the distribution space and the division by the the 'total' in the averaging takes care of the scaling.
Linear algebra is amazing.
Pagerank is an eigenvector of the normalised web adjacency matrix. Gaussian distribution is the eigenvector of the infinite averaging matrix. Essentially the same idea.
Also, convolving a distribution with itself is NOT a linear operation, hence cannot be described by a matrix multiplication with a fixed matrix.
I address scaling, very peripherally, towards the end. Of course, depending on how you scale you end up with distinctly different limit laws.
> Linear algebra is amazing.
The entire control systems theory is basically various applications of linear algebra. Like Kalman Filter that got us to the moon. Simply amazing.
> Anyone familiar with linear algebra will know that repeated matrix multiplication by non degenerate matrices reveals it's eigenvectors.
TIL that I'm not "familiar" with linear algebra ;)
But seriously, thanks for sharing that knowledge.
So humble and basic a field. So wide it's consequences and scope.
But which one ? The one with the largest eigenvalue among all eigenvectors not orthogonal to b.
https://en.wikipedia.org/wiki/Power_iteration
Take logarithm of the eigenvalues and you get back the angle. This to me had solidified the notion that angles are essentially a logarithmic notion ... Made more rigorous by the notion of exponential maps
My expression of gratitude was sincere.
Phone autocorrect always interferes and I get tired and lazy about correcting it back. It does get it right most of the time.
The great philosophical question is why CLT applies so universally. The article explains it well as a consequence of the averaging process.
Alternatively, I’ve read that natural processes tend to exhibit Gaussian behaviour because there is a tendency towards equilibrium: forces, homeostasis, central potentials and so on and this equilibrium drives the measurable into the central region.
For processes such as prices in financial markets, with complicated feedback loops and reflexivity (in the Soros sense) the probability mass tends to ends up in the non central region, where the CLT does not apply.
https://en.wikipedia.org/wiki/Infinite_divisibility_(probabi...
https://en.wikipedia.org/wiki/Stable_distribution
This applies even when the variance is not finite.
Note independence and identical nature of distribution is not necessary for Central Limit Theorem to hold. It is a sufficient condition, not a necessary one, however, it does speed up the convergence a lot.
Gaussian distribution is a special case of the infinitely divisible distribution and is the most analytically tractable one in that family.
Whereas, averaging gives you Gaussian as long as the original distribution is somewhat benign, the MAX operator also has nice limiting properties. They converge to one of three forms of limiting distributions, Gumbel being one of them.
The general form of the limiting distributions when you take MAX of a sufficiently large sample are the extreme value distributions
https://en.wikipedia.org/wiki/Generalized_extreme_value_dist...
Very useful for studying record values -- severest floods, world records of 100m sprints, world records of maximum rainfall in a day etc
> the “steadfast order of the universe” that eventually overcame any and all deviations from the bell.
I can’t believe the author wrote that without explaining why it’s called the bell curve.
I find the article spends a lot of time talking about repeating games without really getting to the meat of it.
If you throw a dice a million times the result is still following a uniform distribution.
It isn’t until you start summing random events that the normal distribution occurs.
"Order in Apparent Chaos.-I know of scarcely any-, thing so apt to impress the imagination as the wonderful form of cosmic order expressed by the " Law of Frequency of Error." The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and • in complete self-effacement amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason."
https://galton.org/cgi-bin/searchImages/galton/search/books/...
He has several other related videos also.
https://www.youtube.com/@3blue1brown/search?query=convolutio...
https://en.wikipedia.org/wiki/Galton_board
at the (I think) Boston Science Museum when I was a kid. They have some pretty cool videos on Youtube if you're curious.
Edit: see eg John Baez's write-up What is Entropy? about the entropy maximization principle, where gaussians make an entrance.
BUT for the exceptional world, causes multiply or cascade: earthquake magnitudes, network connectivity, etc. So, you get log-normal or fat-tailed.
For simplicity, take N identically distributed random variables that are uniform on the interval from [-1/2,1/2], so the probability distribution function, f(x), on the interval from [-1/2,1/2] is 1.
The Fourier transform of f(x), F(w), is essentially sin(w)/w. Taking only the first few terms of the Taylor expansion, ignoring constants, gives (1-w^2).
Convolution is multiplication in Fourier space, so you get (1-w^2)^n. Squinting, (1-w^2)^n ~ (1-n w^2 / n)^n ~ exp(-n w^2). The Fourier transform of a Gaussian is a Gaussian, so the result holds.
Unfortunately I haven't worked it out myself but I've been told if you fiddle with the exponent of 2 (presumably choosing it to be in the range of (0,2]), this gives the motivation for Levy stable distributions, which is another way to see why fat-tailed/Levy stable distributions are so ubiquitous.
> suppose that a large sample of observations is obtained, each observation being randomly produced in a way that does not depend on the values of the other observations, and the average (arithmetic mean) of the observed values is computed. If this procedure is performed many times, resulting in a collection of observed averages, the central limit theorem says that if the sample size is large enough, the probability distribution of these averages will closely approximate a normal distribution.
> Place a measuring cup in your backyard every time it rains and note the height of the water when it stops: Your data will conform to a bell curve.
That strikes me as unlikely, actually: that the amount of water to fall (per area) across rain showers ("when it stops") is normally distributed. Why would the author think that?
Also, not much of "the math that explains" the CLT in the article. The basic conditions are:
The samples you add together must be
- sufficiently independent
- sufficiently well-behaved in the sense of not having huge outliers (finite variance is good enough for this)
Not sure either condition holds for rainfall.
The causal chain is: the math is simple -> teachers teach simple things -> students learn what they're taught -> we see the world in terms of concepts we've learned.
The central limit theorem generalizes beyond simple math to hard math: Levy alpha stable distributions when variance is not finite, the Fisher-Tippett-Gnedenko theorem and Gumbel/Fréchet/Weibull distributions regarding extreme values. Those curves are also everwhere, but we don't see them because we weren't taught them because the math is tough.
The article doesn't explain why. It explains a bunch of cases and works backwards to show that the original premise was true. This sounds fine but the end of the article specifically mentioned that this is dangerous because the world doesn't always work like this.
This is the problem with induction, it might work in 99% of cases, I've never seen a Black Swan so there must not be any black swans?
Deduction has more value when it comes to math specifically... I'll admit that as an inductionist.
Unfortunately, many "researchers" blindly assume that many real life phenomena follow Gaussian, which they don't... So then their models are skewed
> Laplace distilled this structure into a simple formula, the one that would later be known as the central limit theorem. No matter how irregular a random process is, even if it’s impossible to model, the average of many outcomes has the distribution that it describes. “It’s really powerful, because it means we don’t need to actually care what is the distribution of the things that got averaged,” Witten said. “All that matters is that the average itself is going to follow a normal distribution.”
This is not really true, because the central limit theorem requires a huge assumption: that the random process has finite variance. I believe that distributions that don't satisfy that assumption, which we can call heavy-tailed distributions, are much more common in the real world than this discussion suggests. Pointing out that infinities don't exist in the real world is also missing the point, since a distribution that just has a huge but finite variance will require a correspondingly huge number of samples to start behaving like a normal distribution.
Apart from the universality, the normal distribution has a pretty big advantage over others in practice, which is that it leads to mathematical models that are tractable in practice. To go into a slightly more detail, in mathematical modeling, often you define some mathematical model that approximates a real-world phenomenon, but which has some unknown parameters, and you want to determine those parameters in order to complete the model. To do that, you take measurements of the real phenomenon, and you find values for the parameters that best fit the measurements. Crucially, the measurements don't need to be exact, but the distribution of the measurement errors is important. If you assume the errors are independent and normally distributed, then you get a relatively nice optimization problem compared to most other things. This is, in my opinion, about as much responsible for the ubiquity of normal distributions in mathematical modeling as the universality from the central limit theorem.
However, as most people who solve such problems realize, sometimes we have to contend with these things called "outliers," which by another name are really samples from a heavy-tailed distribution. If you don't account for them somehow, then Bad Things(TM) are likely to happen. So either we try to detect and exclude them, or we replace the normal distribution with something that matches the real data a bit better.
Anyway, to connect this all back to the central limit theorem, it's probably fair to say measurement errors tend to be the combined result of many tiny unrelated effects, but the existence of outliers is pretty strong evidence that some of those effects are heavy-tailed and thus we can't rely on the central limit theorem giving us a normal distribution.
a vast amount of fluff for less than a college statistics professor would (hopefully) be able to impart with a chalkboard in 10 minutes, when Quanta has the ability to prepare animated diagrams like 3Blue1Brown but chooses not to use it
they could go down myriad paths, like how it provides that random walks on square lattices are asymptotically isotropic, or give any other simple easy-to-understand applications (like getting an asymptotic on the expected # of rolls of an n-sided die before the first reoccurring face) or explain what a normal distribution is, but they only want to tell a story to convey a feeling
they are a blight upon this world for not using their opportunity to further public engagement in a meaningful way
This is a tautology to the extreme.