With regard to machine learning, for many problems of small to moderate size, some Bayesian methods, such as those based on neural networks or mixture models that I’ve worked on, are not just theoretically attractive, but also practically superior to the alternatives.
This is not the case for large-scale image or language models, for which any close approximation to true Bayesian inference is very difficult computationally.
However, I think Bayesian considerations have nevertheless provided more insight than frequentism in this context. My results from 30 years ago showing that infinitely-wide neural networks with appropriate priors work well without overfitting have been a better guide to what works than the rather absurd discussions by some frequentist statisticians of that time about how one should test whether a network with three hidden units is sufficient, or whether instead the data justifies adding a fourth hidden unit. Though as commented above, recent large-scale models are really more a success of empirical trial-and-error than of any statistical theory.
One can also look at Vapnik’s frequentist theory of structural risk minimization from around the same time period. This was widely seen as justifying use of support vector machines (though as far as I can tell, there is no actual formal justification), which were once quite popular for practical applications. But SVMs are not so popular now, being perhaps superceded by the mathematically-related Bayesian method of Gaussian process regression, whose use in ML was inspired by my work on infinitely-wide neural networks. (Other methods like boosted decision trees may also be more popular now.)
One reason that thinking about Bayesian methods can be fruitful is that they involve a feedback process:
Think about what model is appropriate for your problem, and what prior for its parameters is appropriate. These should capture your prior beliefs.
Gather data.
Figure out some computational method to get the posterior, and predictions based on it.
Check whether the posterior and/or predictions make sense, compared to your subjective posterior (informally combining prior and data). Perhaps also look at performance on a validation set, which is not necessary in Bayesian theory, but is a good idea in practice given human fallibility and computational limitations.
You can also try proving theoretical properties of the prior and/or posterior implied by (1), or of the computational method of step (3), and see whether they are what you were hoping for.
If the result doesn’t seem acceptable, go back to (1) and/or (3).
Prior beliefs are crucial here. There’s a tension between what works and what seems like the right prior. When these seem to conflict, you may gain better understanding of why the original prior didn’t really capture your beliefs, or you may realize that your computational methods are inadequate.
So, for instance, infinitely wide neural networks with independent finite-variance priors on the weights converge to Gaussian processes, with no correlations between different outputs. This works reasonably well, but isn’t what many people were hoping and expecting—no “hidden features” learned about the input. And non-Bayesian neural networks sometimes perform better than the corresponding Gaussian process.
Solution: Don’t use finite-variance priors. As I recommended 30 years ago. With infinite-variance priors, the infinite-width limit is a non-Gaussian stable process, in which individual units can capture significant hidden features.
I’d be interested in @Radford Neal’s take on this dialogue (context).
OK. My views now are not far from those of some time ago, expressed at https://glizen.com/radfordneal/res-bayes-ex.html
With regard to machine learning, for many problems of small to moderate size, some Bayesian methods, such as those based on neural networks or mixture models that I’ve worked on, are not just theoretically attractive, but also practically superior to the alternatives.
This is not the case for large-scale image or language models, for which any close approximation to true Bayesian inference is very difficult computationally.
However, I think Bayesian considerations have nevertheless provided more insight than frequentism in this context. My results from 30 years ago showing that infinitely-wide neural networks with appropriate priors work well without overfitting have been a better guide to what works than the rather absurd discussions by some frequentist statisticians of that time about how one should test whether a network with three hidden units is sufficient, or whether instead the data justifies adding a fourth hidden unit. Though as commented above, recent large-scale models are really more a success of empirical trial-and-error than of any statistical theory.
One can also look at Vapnik’s frequentist theory of structural risk minimization from around the same time period. This was widely seen as justifying use of support vector machines (though as far as I can tell, there is no actual formal justification), which were once quite popular for practical applications. But SVMs are not so popular now, being perhaps superceded by the mathematically-related Bayesian method of Gaussian process regression, whose use in ML was inspired by my work on infinitely-wide neural networks. (Other methods like boosted decision trees may also be more popular now.)
One reason that thinking about Bayesian methods can be fruitful is that they involve a feedback process:
Think about what model is appropriate for your problem, and what prior for its parameters is appropriate. These should capture your prior beliefs.
Gather data.
Figure out some computational method to get the posterior, and predictions based on it.
Check whether the posterior and/or predictions make sense, compared to your subjective posterior (informally combining prior and data). Perhaps also look at performance on a validation set, which is not necessary in Bayesian theory, but is a good idea in practice given human fallibility and computational limitations.
You can also try proving theoretical properties of the prior and/or posterior implied by (1), or of the computational method of step (3), and see whether they are what you were hoping for.
If the result doesn’t seem acceptable, go back to (1) and/or (3).
Prior beliefs are crucial here. There’s a tension between what works and what seems like the right prior. When these seem to conflict, you may gain better understanding of why the original prior didn’t really capture your beliefs, or you may realize that your computational methods are inadequate.
So, for instance, infinitely wide neural networks with independent finite-variance priors on the weights converge to Gaussian processes, with no correlations between different outputs. This works reasonably well, but isn’t what many people were hoping and expecting—no “hidden features” learned about the input. And non-Bayesian neural networks sometimes perform better than the corresponding Gaussian process.
Solution: Don’t use finite-variance priors. As I recommended 30 years ago. With infinite-variance priors, the infinite-width limit is a non-Gaussian stable process, in which individual units can capture significant hidden features.