I went through grad school in a very frequentist environment. We “learned” Bayesian methods but we never used them much.
In my professional life I’ve never personally worked on a problem that I felt wasn’t adequately approached with frequentist methods. I’m sure other people’s experiences are different depending on the problems you gravitate towards.
In fact, I tend to get pretty frustrated with Bayesian approaches because when I do turn to them it tends to be in situations that already quite complex and large. In basically every instance of that I’ve never been able to make the Bayesian approach work. Won’t converge or the sampler says it will take days and days to run. I can almost always just resort to some resampling method that might take a few hours but it runs and gives me sensible results.
I realize this is heavily biased by basically only attempting on super-complex problems, but it has sort of soured me on even trying anymore.
To be clear I have no issue with Bayesian methods. Clearly they work well and many people use them with great success. But I just haven’t encountered anything in several decades of statistical work that I found really required Bayesian approaches, so I’ve really lost any motivation I had to experiment with it more.
I think Rafael Irizarry put it best over a decade ago -- while historically there was a feud between self-declared "frequentists" and "Bayesians", people doing statistics in the modern era aren't interested in playing sides, but use a combination of techniques originating in both camps: https://simplystatistics.org/posts/2014-10-13-as-an-applied-...
The author makes a comparison to Haskell, which I think might be a little misleading.
Haskell is a little more complicated to learn but also more expressive than other programming languages, this is where the comparison works.
But where it breaks down is safety. If your Haskell code runs, it's more likely to be correct because of all the type system goodness.
That's the reverse of the situation with Bayesian statistics, which is more like C++. It has all kinds of cool features, but they all come with superpowered footguns.
Frequentist statistics is more like Java. No one loves it but it allows you to get a lot of work done without having to track down one of the few people who really understand Haskell.
As a data scientist, I find applied Bayesian methods to be incredibly straightforward for most of the common problems we see like A/B testing and online measuring of parameters. I dislike that people usually first introduce Bayesian methods theoretically, which can be a lot for beginners to wrap their head around. Why not just start from the blissful elegance of updating your parameter's prior distribution with your observed data to magically get your parameter's estimate?
This article made me enthusiastic to dive into Bayesian statistics (again). A quick search led me to Think Bayes [1], which also introduces the concepts using Python, and seems to have a little more depth.
Nice writeup. Something that clicked for me reading this is how much the prior/likelihood/posterior dynamic mirrors transfer learning in deep learning. The prior is basically your pre-trained weights: broad knowledge you bring to the table before seeing any task-specific data. The likelihood is your fine-tuning step. And the Bernstein-von Mises result at the end is essentially saying "with enough fine-tuning data, your pre-training washes out."
Obviously the analogy isn't perfect (priors are explicit and interpretable, pre-trained weights are not), but I think it's a useful mental model for anyone coming from an ML background who finds Bayesian stats unintuitive. Regularization being secretly Bayesian was the other thing that made it click for me. If you've ever tuned a Ridge regression lambda, you were doing informal prior selection.
I think it would be interesting if frequentist stats can come up with more generative models. Current high level generative machine learning all rely on Bayesian modeling.
Most ML practitioners use L1/L2 daily without realizing they're making Bayesian prior assumptions. Gaussian prior = Ridge, Laplace prior = Lasso. Once you see it that way, "choosing a regularization strength" is really "choosing how informative your prior is."
Nicely done.
I have the same challenge with Bayesian stats and usually do not understand why there is such controversy. It isn’t a question of either/or, except in the minds of academics who rarely venture out into the real world, or have to balance intellectual purity with getting a job done.
In the very first example, a practitioner would consciously have to decide (i.e. make the assumption) whether the number of side on the die (n) is known and deterministic. Once that decision is made, the framework with which observations are evaluated and statistical reasoning applied will forever be conditional on that assumption.. unless it is revised. Practitioners are generally OK with that, whether it leads to ‘Bayesian’ or ‘frequentist’ analysis, and move on.
60 comments
In my professional life I’ve never personally worked on a problem that I felt wasn’t adequately approached with frequentist methods. I’m sure other people’s experiences are different depending on the problems you gravitate towards.
In fact, I tend to get pretty frustrated with Bayesian approaches because when I do turn to them it tends to be in situations that already quite complex and large. In basically every instance of that I’ve never been able to make the Bayesian approach work. Won’t converge or the sampler says it will take days and days to run. I can almost always just resort to some resampling method that might take a few hours but it runs and gives me sensible results.
I realize this is heavily biased by basically only attempting on super-complex problems, but it has sort of soured me on even trying anymore.
To be clear I have no issue with Bayesian methods. Clearly they work well and many people use them with great success. But I just haven’t encountered anything in several decades of statistical work that I found really required Bayesian approaches, so I’ve really lost any motivation I had to experiment with it more.
Haskell is a little more complicated to learn but also more expressive than other programming languages, this is where the comparison works.
But where it breaks down is safety. If your Haskell code runs, it's more likely to be correct because of all the type system goodness.
That's the reverse of the situation with Bayesian statistics, which is more like C++. It has all kinds of cool features, but they all come with superpowered footguns.
Frequentist statistics is more like Java. No one loves it but it allows you to get a lot of work done without having to track down one of the few people who really understand Haskell.
[1] https://allendowney.github.io/ThinkBayes2/
Obviously the analogy isn't perfect (priors are explicit and interpretable, pre-trained weights are not), but I think it's a useful mental model for anyone coming from an ML background who finds Bayesian stats unintuitive. Regularization being secretly Bayesian was the other thing that made it click for me. If you've ever tuned a Ridge regression lambda, you were doing informal prior selection.
In the very first example, a practitioner would consciously have to decide (i.e. make the assumption) whether the number of side on the die (n) is known and deterministic. Once that decision is made, the framework with which observations are evaluated and statistical reasoning applied will forever be conditional on that assumption.. unless it is revised. Practitioners are generally OK with that, whether it leads to ‘Bayesian’ or ‘frequentist’ analysis, and move on.