On Frequentism and Bayesian Dogma
I’ve heard that you believe that frequentism is correct. But that’s obviously wrong, so what gives?
I guess first of all I should ask, what do you mean by “frequentism”?
I mean classical statistical frequentism. Though somewhat tongue-in-cheek, as I don’t think it’s fully correct, I think it’s much more correct than orthodox Jaynesian Bayesianism.
Some scattered thoughts:
Bayes’ theorem derives from conditional probability so it’s also included in frequentism.
Bayesian epistemology only applies to situations when your beliefs are a probability distribution, and is thus incomplete.
It doesn’t account for e.g. limited computation.
Frequentism solves these things by framing the problem in a different way. Rather than ‘how should I think?’ it’s “this algorithm seems like a sensible way to think, let’s figure out what epistemic guarantees it has”.
In particular, it makes it OK to believe things that are not expressible as probability distributions.
I’m still sort of unsure what you mean by “classical statistical frequentism”. Like, I’m pretty sure I agree that the purported theorems of Fisher are in fact theorems. Do you mean something like “the way we should think about thinking is to ask ‘what cognitive algorithms perform well with high probability in the long run’”?
(and regarding Bayesianism, I think that’s a separate question that’s more productively talked about once I understand why you think frequentism is good)
Sure. Thank you for the clarification—I agree there are many fine theorems on both sides.
Statistics is the problem of learning from data, frequentism is saying “Here’s an algorithm that takes the data and computes something (that’s an estimator). Let’s study the properties of it.”
“the way we should think about thinking is to ask ‘what cognitive algorithms perform well with high probability in the long run’”?
Yeah, I agree with this. Though I recognize in practice it’s pretty hard, frequentist theory attempts to do so (for performance usually being ‘have correct beliefs’).
Here’s an algorithm that takes the data and computes something (that’s an estimator). Let’s study the properties of it.
I first want to note that this is an exhortation and not a proposition. Regarding the implicit exhortation of “it’s good to understand estimators”, I guess I agree? I think my crux is something like “but it’s relevant that some worlds are a priori more likely than others, and you want to do better in those” (and of course now we’re in the territory where we need to argue about Bayesianism).
I first want to note that this is an exhortation and not a proposition
Sure. I mean that the examples of knowledge frequentism creates (theorems and such) are things derived from this exhortation. E.g. “consider the algorithm of taking the sample mean. What does that say about the true population mean?” is an example of a very classical frequentism question with a useful answer.
but it’s relevant that some worlds are a priori more likely than others
Sure, you can analyze whether your estimator does well in this way!
and of course now we’re in the territory where we need to argue about Bayesianism).
Why do you say that?
I guess you’re referencing the part where Frequentism is like “propositions are only true or false, you can’t believe in probabilities”.
Fine. But you have the ‘probabilities’ be numbers in your estimator-algorithm. It is true that, at the end of the date, propositions are either true or false.
In fact outputting the ‘Bayesian probability’ is an estimator with good properties for estimating truth/falsehood of a proposition, with Brier loss or whatever. So that’s a draw for freq vs Bayes.
I guess I think of ‘Frequentism’ as definitely believing in probabilities—just probabilities of drawing different samples, rather than a priori probabilities, or probabilities of ground truths given a sample. So I feel that the question is which type of probability is more important, or more relevant. (Like, I certainly agree that “understand what sort of algorithms do well in probabilistic settings” is the right way to think about cognition, and you don’t have to over-reify the features of those cognitive algorithms!)
Another potential question could be “how valuable is it for cognitive algorithms to not be constrained by having their main internal representations be probability distributions over possible worlds”.
Basically you presented this as a hot take, and I’m trying to figure out where you expect to disagree with people.
Another possible question: how valuable is the work produced by frequentist statisticians?
So I feel that the question is which type of probability is more important, or more relevant.
I’m not sure I agree that this is the important question. Or rather it is, but I would answer it pragmatically: what sort of approaches to epistemology does focusing on each type of probability produce, what questions and answers does it lead you to produce? And I think here frequentism wins. This ties neatly with:
how valuable is the work produced by frequentist statisticians?
Historically pretty valuable! It’s good to understand the guarantees of e.g. the ‘sample mean’ estimator. Bandit algorithms are also a glorious frequentist achievement, as argued by Steinhardt. The Bootstrap method, a way to figure out your estimator’s uncertainty without assuming much about the data (no Bayesian prior distribution for one) is also great.
But I think the theoretical pickings are pretty slim at this point—cool stuff, but it’s unlikely that there’ll be something as fundamental as the sample mean.
The field to which statistics now is relevant is machine learning. and here I think frequentists have won an absolute victory: all the neural networks are probabilistic, but Bayesian ML needs way more computation than mainstream ML for the same or worse results.
And IMO this is because of an overreliance on “the theory says the algorithm will work if done this way, therefore we’re going to do it this way” versus a willingness to experiment with various algorithms (i.e. estimators) without quite understanding why they work, and seeing which one works.
how valuable is it for cognitive algorithms to not be constrained by having their main internal representations be probability distributions over possible worlds
I think this is very valuable as exemplified by Bayesian vs mainstream ML.
OK, I’m getting a better sense of where our disagreements may lie.
I agree that the historical record of frequentist statistics is pretty decent. I am somewhat more enthusiastic about more “Bayesian” approaches to bandits, e.g. Thompson sampling, than it sounds like you are, but this might just be tribalism—and if I think about the learning algorithm products of the AI alignment community that I’m excited about (logical induction, boundedly rational inductive agents), they look more frequentist than Bayesian.
I think my real gripe is that I see this massive impact of frequentism on the scientific method as promoting the use of p-values and confidence intervals, which, IMO, are using conditional probabilities in the wrong direction (one way to tell this: ask any normal scientist what a p-value or a confidence interval is, and there’s a high chance that they’ll give an explanation of what the Bayesian equivalent would be).
Now, I think it’s sort of fair enough to say “but that’s not what Ronald Fisher would do” or “people would and do misuse Bayesian methods too”, and all of these are right (as a side-note, I’ve noticed that when people introduce Bayes theorem in arguments about religion they’re typically about to say something unhinged), but it’s notable to me that tons of people seem to want the Bayesian thing.
---
Regarding Bayesian vs standard machine learning: on the one hand, I share your impression that the Bayesian methods are terrible and don’t work, and that empiricism / tight feedback loops are important for making progress. On the other hand, as far as I can tell, the ML community is on track to build things that kill me and everyone I care about and also everyone else, and I kind of chalk this up to them not understanding enough about the generalization properties of their algorithms. So I actually don’t take this as the win for frequentism that it looks like.
it’s notable to me that tons of people seem to want the Bayesian thing.
I agree that Bayesian statistics are more intuitive than p-values. It’s sad in my opinion that you need to assume prior probabilities about your hypotheses to get the Bayesian-style p(hypothesis | data), which is what we all love. But the math comes out that way.
Maybe also log-likelihood ratios would be better reported in papers (you can add them up!) but then people would add up log-likelihood ratios for slightly different hypotheses and convince them that they’re valid (it can be, but it’s unclear what assumptions you need to do that), and it would be a huge mess. That’s not your strongest point though.
On the other hand, as far as I can tell, the ML community is on track to build things that kill me and everyone I care about and also everyone else
Now we’re talking!
I kind of chalk this up to them not understanding enough about the generalization properties of their algorithms
Fair, but that doesn’t mean you can chalk it up to frequentism. I don’t think the Bayesian approach (here I very much mean the actual Bayesian ML community[1]) is any better at this. They work kind of backwards: instead of fitting their theory to observable data about experiments, they assume Bayesian theory and kind of shrug when the experiments don’t work out. IMO the right way to understand generalization is to have a theory, and then change it when the experiment contradicts it.
Part of the reason this is justifiable to the Bayesian ML folks is that the experiments aren’t quite about Bayesian theoretical ideal, they’re about practical algorithms. My position here is that I would like my theories to talk about the actual things people do. I am wary of theorems about asymptotics for the same reason: technically they don’t talk about what happens in finite time.
In my opinion we should discard the culture of this particular academic sub-field, and talk about how good would the best possible Bayesian approach to understanding ML generalization be. Two versions of this:
Understand existing algorithms. I claim that its fixation on having the only valid beliefs be well-specified probability distribution, and the lack of claims about what happens in any finite time, would make it impossible for them to make progress. Though maybe the dev-interp people will succeed (I doubt it, but we’ll see; and they’re studying MCMC’s behavior in practice so not quite Bayesian).
Create Bayesian algorithms that are therefore well understood. This is the holy grail of Bayesian ML, but I don’t think this will happen. Maintaining beliefs as probabilities that are always internally self-consistent is expensive and not always necessary, and also IMO not all beliefs are representable as probability distributions (radical uncertainty). Also you need a better understanding of good reasoning under finite computation which, as you wrote above, is more frequentist. (I agree with this point, and I think it’s frequentist because frequentism is about analyzing estimators).
- ^
Examples of people who made this error: myself from 6 years ago, myself from 3 years ago. I would argue many of my grad student peers and professors made (and yet make) the same mistake. Yes, this formative experience is an important contributor to the existence of this dialogue.
Part of the reason this is justifiable to the Bayesian ML folks is that the experiments aren’t quite about Bayesian theoretical ideal, they’re about practical algorithms. My position here is that I would like my theories to talk about the actual things people do.
I think this suggests a place where I have some tension with your view: while I certainly agree that theories should be about the things people actually do, and that Bayesianism can fall short on this score, I also want theories to meaningfully guide what people do! Cognitive algorithms can be better and worse, and we should use (and analyze) the better ones, rather than the worse ones. One way of implementing this could be “try a bunch of cognitive algorithms and see what works”, but once your algorithms include “play nice while you’re being tested then take over the world”, empiricism isn’t enough: we either need theory to guide us away from those algorithms, or we need to investigate the internals of the algorithms that we try, and make sure they comply with certain standards that rule out treacherous turn behaviour.
Now, this theory of what algorithms should look like or what they should have in their internals doesn’t have to be Bayesianism—in fact, it probably doesn’t work for it to be Bayesianism, because to understand a Bayesian you need to understand their event space, which could be weird and inscrutable. But once you’ve got such a theory, I think you’re at least outside of the domain of “mere frequentism” (altho I admit that in some sense any time you think about how an algorithm works in a probabilistic setting you’re in some sense a frequentist).
As a side note:
Also you need a better understanding of good reasoning under finite computation which, as you wrote above, is more frequentist.
This might be an annoying definitional thing, but I don’t think good reasoning under finite computation has to be ‘frequentist’. As an extreme example, I wouldn’t call Bayes net algorithms frequentist, even tho with finite size they run in finite time. I call logical induction and boundedly rational inductive agents ‘frequentist’ because they fall into the family of “have a ton of ‘experts’ and play them off against each other” (and crucially, don’t constrain those experts to be ‘rational’ according to some a priori theory of good reasoning).
Good point. True Bayesian algos are only finite if the world is finite though; and the world is too large to count as finite for the purposes of a competent AGI. I should have said “with computation bounded under what the requirements of the world are”, or something similar but less unwieldy.
Now, this theory of what algorithms should look like or what they should have in their internals doesn’t have to be Bayesianism—in fact, it probably doesn’t work for it to be Bayesianism, because to understand a Bayesian you need to understand their event space, which could be weird and inscrutable. But once you’ve got such a theory, I think you’re at least outside of the domain of “mere frequentism” (altho I admit that in some sense any time you think about how an algorithm works in a probabilistic setting you’re in some sense a frequentist).
I agree with all of this. I call this “Bayesianism is wrong and frequentism is correct”, maybe I shouldn’t call it that?
Well, I was more thinking of Bayesianism as being insufficient for purpose, rather than necessarily “wrong” here.
I feel like we’ve transformed the initial dispute into a new, clearer, and more exciting dispute. Perhaps this is a good place to stop?
I’m not sure we agree on what the new dispute is, I’d like to explore that! But perhaps the place for that is another dialogue.
I would say Bayesianism is wrong like Newtonian mechanics is wrong. It’s a very good approximation of reality for some domains (in Newtonian mechanics’ case, macroscopic objects at low energy scales, in Bayesian statistics’ case, epistemic problems with at most ~millions of possible outcomes).
The frequentist frame I exposed here (let’s analyze some actual algorithms) is IMO more likely to point at the kind of thing we want out of a theory of epistemology. But I guess classical frequentist methods are also not close to solving alignment, and also didn’t accurately predict that deep NNs would work so well (they have so many parameters, you’re going to overfit!)
So maybe frequentism is wrong in the same way. But I think the shift from “the theory is done and should guide algorithms” to “the theory should explain what’s going on in actual algorithms” is important.
Maybe we should write a consensus statement to conclude?
I guess we have a few disagreements left...
I would say Bayesianism is wrong like Newtonian mechanics is wrong. It’s a very good approximation of reality for some domains
I wouldn’t think about Bayesianism this way—I’d say that Bayesianism is the best you can do when you’re not subject to computational / provability / self-reflection limitations, and when you are subject to those limitations, you should think about how you can get what’s good about Bayesianism for less of the cost.
But I think the shift from “the theory is done and should guide algorithms” to “the theory should explain what’s going on in actual algorithms” is important.
This still feels incomplete to me for reasons described in my earlier comment: Yes, it’s bad to be dogmatic about theories that aren’t quite right, and yes, theories have got to describe reality somehow, but also, theories should guide you into doing good things rather than bad things!
How about this as a consensus statement?
Frequentism has the virtue of describing the performance of algorithms that are possible to run, without being overly dogmatic about what algorithms must look like. By contrast, Bayesianism is only strictly applicable in cases where computation is not limited, and its insistence on limiting focus to algorithms that carry around probability distributions that they update using likelihood ratios is overly limiting. In future, we need to develop ways of thinking about cognitive algorithms that describe real algorithms that can actually be run, while also providing useful guidance.
I’d say that Bayesianism is the best you can do when you’re not subject to computational / provability / self-reflection limitations,
I disagree with this, by the way. Even under these assumptions, you still have the problem of handling belief states which cannot be described as a probability distribution. For small state spaces, being fast and loose with that (e.g. just belief the uniform distribution over everything) is fine, but larger state spaces run into problems, even if you have infinite compute and can prove everything and don’t need to have self-knowledge.
I endorse the consensus statement you wrote!
And perhaps a remaining point of dispute is: how important is it to have non-probabilistic beliefs?
Sure, I’m happy to leave it at that. Thank you for being a thoughtful dialogue partner!
Thanks for the fun chat :)
What do you think about Wald’s complete class theorems and other similar decision-theoretic results that say that, under a fixed frequentist setting, the set of admissible algorithms coincides (barring messes with infinities) with the set of Bayesian procedures as all possible priors are considered? In other words, if you think it makes sense to strive for the “best” procedure in a context, for any fixed even if unknown definition of what’s best, and you have a frequentist procedure you think is statistically good, then there must be a corresponding Bayesian prior.
(This is an argument I’d always like to see addressed as basic disclaimer in frequentist vs bayesian discussions, I think it helps a lot to put down the framework people are reasoning under, e.g., if it’s more practical vs. theoretical.)
My own opinion on the topic (I’m pro-Bayes):
Many standard frequentist things can be derived as easily or more easily in a Bayesian way; that they are conventionally considered frequentist is an irrelevant accident of history.
In tree methods, the frequentist version comes first, but the Bayesian version, when it arrives, is better, and usable in practice.
Practically all real Bayesian methods are not purely Bayesian, there are many ad hockeries. The point is using Bayes as a guide. Even with an algorithm pulled out of the hat, it’s useful to know if it has a Bayesian interpretation, because it makes it clearer.
ML is frequentist only in the sense of trying algorithms without set rules, I don’t think that should be counted as frequentist success! It’s too generic. I have the impression the mindset of the people working in ML that know their shit is closer to Bayesian, but I am not confident in this since it’s an indirect impression. Example: information theoretic stuff is more natural with Bayes.
I’m a little surprised this didn’t come up earlier. As I mentioned to Adrià, I think the thing Bayesianism is about is more “how to think about epistemology” (where complaints like “but not everything is a probability distribution! How do you account for conjectures?” live) and the fact that the main frequentist tool used in science is totally misused and misunderstood seems to me like it’s a pretty good argument in favor of “you should be thinking like a Bayesian.”
Like, if the thing with frequentism is “yeah just use methods in a pragmatic way and don’t think about it that hard” it’s not really a surprise that people didn’t think about things that hard and this leads to widespread confusion and mistakes.
I think this does not accurately represent my beliefs. It is about thinking hard about how the methods actually behave, as opposed to having a theory that prescribes how methods should behave and then constructing algorithms based on that.
Frequentists analyze the properties of an algorithm that takes data as input (in their jargon, an ‘estimator’).
They also try to construct better algorithms, but each new algorithm is bespoke and requires original thinking, as opposed to Bayes which says “you should compute the posterior probability”, which makes it very easy to construct algorithms. (This is a drawback of the frequentist approach—algorithm construction is not automatic. But the finite-computation Bayesian algorithms have very few guarantees anyways so I don’t think we should count it against them too much).
I think having rando social scientists using likelihood ratios would also lead to mistakes and such.
What sort of problems?
In short, the probability distribution you choose contains lots of interesting assumptions about what states are more likely that you didn’t necessarily intend. As a result most of the possible hypotheses have vanishingly small prior probability and you can never reach them. Even though with a frequentist approach
For example, let us consider trying to learn a function with 1-dim numerical input and output (e.g. R→R). Correspondingly, your hypothesis space is the set of all such functions. There are very many functions (infinitely many if RR, otherwise a crazy number).
You could use the Solomonoff prior (on a discretized version of this), but that way lies madness. It’s uncomputable, and most of the functions that fit the data may contain agents that try to get you to do their bidding, all sorts of problems.
What other prior probability distribution can we place on the hypothesis space? The obvious choice in 2023 is a neural network with random weights. OK, let’s think about that. What architecture? The most sensible thing is to randomize over architectures somehow. Let’s hope the distribution on architectures is as simple as possible.
How wide, how deep? You don’t want to choose an arbitrary distribution or (god forbid) arbitrary number, so let’s make it infinitely wide and deep! It turns out that an infinitely wide network just collapses to a random process without any internal features. It turns out an infinitely deep network, but that collapses to a stationary distribution which doesn’t depend on the input. Oops.
Okay, let’s give up and place some arbitrary distribution (e.g. geometric distribution) on the width.
What about the prior on weights? uh idk, zero-mean identity covariance Gaussian? Our best evidence says that this sucks.
At this point you’ve made so many choices, which have to be informed by what empirically works well, that it’s a strange Bayesian reasoner you end up with. And you haven’t even specified your prior distribution yet.
This seems false if you’re interacting with a computable universe, and don’t need to model yourself or copies of yourself. Computability of the prior also seems irrelevant if I have infinite compute. Therefore in this prediction task, I don’t see the problem in just using the first thing you mentioned.
Reasonable people disagree. Why should I care about the “limit of large data” instead of finite-data performance?
Logical/mathematical beliefs — e.g. “Is Fermat’s Last Theorem true?”
Meta-beliefs — e.g. “Do I believe that I will die one day?”
Beliefs about the outcome space itself — e.g. “Am I conflating these two outcomes?”
Indexical beliefs — e.g. “Am I the left clone or the right clone?”
Irrational beliefs — e.g. conjunction fallacy.
e.t.c.
Of course, you can describe anything with some probability distribution, but these are cases where the standard Bayesian approach to modelling belief-states needs to be amended somewhat.
1-4 seem to go away if I don’t care about self-knowledge, and have infinite compute. 5 doesn’t seem like a problem to me. If there is a best reasoning system, it should not make mistakes. Showing that a system can’t make mistakes may show you its not what humans use, but it should not be classified as a problem.
I think I’m mostly confused about how both Daniel and Adria are using the terms bayesian and frequentist. Like, I thought the difference between frequentist and bayesian interpretations of probability theory is that bayesian interpretations say the probability is in your head, while frequentist interpretations say the probability is in the world.
In that sense, showing that the kinds of methods motivated by frequentist considerations can give you insight into algorithms usefulness is maybe a little bit of evidence that probabilities actually exist in some objective sense. But it doesn’t seem to trump the “but that just sounds really absurd to me though” consideration.
In particular, logical induction and boundedly rational inductive agents were given as examples of frequentist methods by Daniel. The first at least seems pretty subjectivist to me, wouldn’t a frequentist think the probability of logical statements, being the most deterministic system, should have only 1 or 0 probabilities? Every time I type 1+1 into my calculator I always get 2! The second seems relatively unrelated to the question, though I know less about it.
First, “probability is in the world” is an oversimplification. Quoting from Wikipedia, “probabilities are discussed only when dealing with well-defined random experiments”. Since most things in the world are not well-defined random experiments, probability is reduced to a theoretical tool for analyzing things that works when real processes are similar enough to well-defined random experiments.
Is there anything that could trump that consideration? One of my main objections to Bayesianism is that it prescribes that ideal agent’s beliefs must be probability distributions, which sounds even more absurd to me.
Estimators in frequentism have ‘subjective beliefs’, in the sense that their output/recommendations depends on the evidence they’ve seen (i.e., the particular sample that’s input into it). The objectivity of frequentist methods is aspirational: the ‘goodness’ of an estimator is decided by how good it is in all possible worlds. (Often the estimator which is best in the least convenient world is preferred, but sometimes that isn’t known or doesn’t exist. Different estimators will be better in some worlds than others, and tough choices must be made, for which the theory mostly just gives up. See e.g. “Evaluating estimators”, Section 7.3 of “Statistical Inference” by Casella and Berger).
Indeed, in reality logical statements are either true or false, and thus their probabilities are either 1 or 0. But the estimator-algorithm is free to assign whatever belief it wants to it.
I agree that logical induction is very much Bayesianism-inspired, precisely because it wants to assign weights from zero to 1 that are as self-consistent as possible (i.e. basically probabilities) to statements. But it is frequentist in the sense that it’s examining “unconditional” properties of the algorithm, as opposed to properties assuming the prior distribution is true. (It can’t do the latter because, as you point out, the prior probability of logical statements is just 0 or 1).
But also, assigning probabilities of 0 or 1 to things is not exclusively a Bayesian thing. You could think of an predictor that outputs numbers between 0 and 1 as an estimator of whether a statement will be true or false. If you were to evaluate this estimator you could choose, say, mean-squared error. The best estimator is the one with the least MSE. And indeed, that’s how probabilistic forecasts are typically evaluated.
Daniel states he considers these frequentist because:
and I think indeed not prescribing that things must think in probabilities is more of a frequentist thing. I’m not sure I’d call them decidedly frequentist (logical induction is very much a different beast than classical statistics) but they’re not in the other camp either.
From one viewpoint, I think this objection is satisfactorily answered by Cox’s theorem—do you find it unsatisfactory (and if so, why)?
Let me focus on another angle though, namely the “absurdity” and gut level feelings of probabilities.
So, my gut feels quite good about probabilities. Like, I am uncertain about various things (read: basically everything), but this uncertainty comes in degrees: I can compare and possibly even quantify my uncertainties. I feel like some people get stuck on the numeric probabilities part (one example I recently ran to was this quote from Section III of this essay by Scott, “Does anyone actually consistently use numerical probabilities in everyday situations of uncertainty?”). Not sure if this is relevant here, but at the risk of going to a tangent, here’s a way of thinking about probabilities I’ve found clarifying and which I haven’t seen elsewhere:
The correspondence
beliefs <-> probabilities
is of the same type as
temperature <-> Celsius-degrees.
Like, people have feelings of warmth and temperature. These come in degrees: sometimes it’s hotter than some other times, now it is a lot warmer than yesterday and so on. And sure, people don’t have a built-in thermometer mapping these feelings to Celsius-degrees, they don’t naturally think of temperature in numeric degrees, they frequently make errors in translating between intuitive feelings and quantitative formulations (though less so with more experience). Heck, the Celsius scale is only a few hundred years old! Still, Celsius degrees feel like the correct way of thinking about temperature.
And the same with beliefs and uncertainty. These come in degrees: sometimes you are more confident than some other times, now you are way more confident than yesterday and so on. And sure, people don’t have a built-in probabilitymeter mapping these feelings to percentages, they don’t naturally think of confidence in numeric degrees, they frequently make errors in translating between intuitive feelings and quantitative formulations (though less so with more experience). Heck, the probability scale is only a few hundred years old! Still, probabilities feel like the correct way of thinking about uncertainty.
From this perspective probabilities feel completely natural to me—or at least as natural as Celsius-degrees feel. Especially questions like “does anyone actually consistently use numerical probabilities in everyday situations of uncertainty?” seem to miss the point, in the same way that “does anyone actually consistently use numerical degrees in everyday situations of temperature?” seems to miss the point of the Celsius scale. And I have no gut level objections to the claim that an ideal agent’s conceptions of
warmthbeliefs correspond to probabilities.I do not understand why neural nets are touted here as a success of frequentism. They don’t seem like a success of any statistical theory to me. Maybe I don’t know my neural network history all that well, or my philosophy of frequentism, but I do know a thing or two about regular statistical learning theory, and it definitely didn’t predict neural networks and the scaling paradigm would work.
I just remembered the main way in which NNs are frequentist. They belong to a very illustrious family of frequentist estimators: the maximum likelihood estimators.
Think about it: NNs have a bunch of parameters. Their loss is basically always logp(y|x,θ) (e.g. mean-squared error for Gaussian p, cross-entropy for categorical p). They get trained by minimizing the loss (i.e. maximizing the likelihood).
In classical frequentist analysis they’re likely to be a terrible, overfitted estimator, because they have many parameters. And I think this is true if you find the actually maximizing parameters θ∗=argmaxθlogp(y|x,θ).
But SGD is kind of a shitty optimizer. It turns out the two mistakes cancel out, and NNs are very effective.
I don’t think I understand your model of why neural networks are so effective. It sounds like you say that on the one hand neural networks have lots of parameters, so you should expect them to be terrible, but they are actually very good because SGD is a such a shitty optimizer on the other hand that it acts as an implicit regularizer.
Coming from the perspective of singular learning theory, neural networks work because SGD weights solutions by their parameter volume, which is dominated by low-complexity singularities, and is close enough to a bayesian posterior that it ends up being able to be modeled well from that frame.
This theory is very bayes-law inspired, though I don’t tout neural networks as evidence in favor of bayesianism, since the question seems not very related, and maybe the pioneers of the field had some deep frequentist motivated intuitions about neural networks. My impression though is they were mostly just motivated by looking at the brain at first, then later on by following trend-lines. And in fact paid little attention to theoretical or philosophical concerns (though not zero, people talked much about connectionism. I would guess this correlated with being a frequentist, though I would guess the correlation was very modest, and maybe success correlated more with just not caring all that much).
There may be a synthesis position here where you claim that SGD weighing solutions by their size in the weight space is in fact what you mean by SGD being a implicit regularizer. In such a case, I claim this is just sneaking in bayes rule without calling it by name, and this is not a very smart thing to do, because the bayesian frame gives you a bunch more leverage on analyzing the system[1]. I actually think I remember a theorem showing that all MLE + regularizer learners are doing some kind of bayesian learning, though I could be mistaken and I don’t believe this is a crux for me here.
If our models end up different, I think there’s a bunch of things which you end up being utterly confused by in deep learning, which I’m not[2].
At the same time repeating that to me this doesn’t seem that relevant to the true question.
In the sense that though I’m confused about lots of the technical details, I would know exactly which books or math or people I should consult to no longer be confused.
I disagree. An inductive bias is not necessarily a prior distribution. What’s the prior?
From another comment of mine:
Also, side-comment: Thanks for the discussion! Its fun.
EDIT: Actually, there should be a term for the stochasticity which you integrate into the SLT equations like you would temperature in a physical system. I don’t remember exactly how this works though. Or if its even known the exact connection with SGD.
Yeah, that’s basically my model. How it regularizes I don’t know. Perhaps the volume of “simple” functions is the main driver of this, rather than gradient descent dynamics. I think the randomness of it is important; full-gradient descent (no stochasticity) would not work nearly as well.
Oh this reminded me of the temperature component of SLT, which I believe modulates how sharply one should sample from the bayesian posterior, or perhaps how heavily to update on new evidence. I forget. In any case, it does this to try to capture the stochasticity component of SGD. Its still an open problem to show how successfully though, I believe.
OK, let’s look through the papers you linked.
This one is interesting. It argues that the regularization properties are not in SGD, but rather in the NN parameterization, and that non-gradient optimizers also find simple solutions which generalize well. They talk about Bayes only in a paragraph in page 3. They say that literature that argues that NNs work well because they’re Bayesian is related (which is true—it’s also about generalization and volumes). But I see little evidence that the explanation in this paper is an appeal to Bayesian thinking. A simple question for you: what prior distribution do the NNs have, according to the findings in this paper?
This paper finds that the probability that SGD finds a function is correlated with the posterior probability of a Gaussian process conditioned on the same data. Except if you use the Gaussian process they’re using to do predictions, it does not work as well as the NN. So you can’t explain that the NN works well by appealing that it’s similar to this particular Bayesian posterior.
I have many problems with SLT and a proper comment will take me a couple extra hours. But also I could come away thinking that it’s basically correct, so maybe this is the one.
Yup this changes my mind about the relevance of this paper.
In brief: In weight space, uniform. In function space, its an open problem and the paper says relatively little about that. Only showing that conditioning on a function with zero loss, and weighing by its corresponding size in the weight space gets you the same result as training a neural network. The former process is sampling from a bayesian posterior.
Less brief: The prior assigns uniform probability to all weights, and I believe a good understanding of the mapping from weights to functions is unknown, though lots of the time there are many directions you can move in in the weight space which don’t change your function, so one would expect its a relatively compressive mapping (in contrast to, say, a polynomial parameterization, where the mapping is one-to-one).
will say more about your other comment later (maybe).
EDIT: Actually, there should be a term for the stochasticity which you integrate into the SLT equations like you would temperature in a physical system. I don’t remember exactly how this works though. Or if its even known the exact connection with SGD.
In absolute terms you’re correct. In relative terms, they’re an object that at least frequentist theory can begin to analyze (as you point out, statistical learning theory did, somewhat unsuccessfully).
Whereas Bayesian theory would throw up its hands and say it’s not a prior that gets updated, so it’s not worth considering as a statistical estimator. This seems even wronger.
More recent theory can account for them working, somewhat. But it’s about analyzing their properties as estimators (i.e. frequentism) as opposed to framing them in terms of prior/posterior (though there’s plenty of attempts to the latter going around).
I think this comment of mine serves well as a response to this as well as the comment it was originally responding to.
I’m curious how this dialogue would evolve if it included a Pearlist, that is, someone who subscribes to Judea Pearl’s causal statistics paradigm. If we use the same sort of “it acts the way its practitioners do” intuition that this dialogue is using, then Pearl’s framework seems like it has the virtue that the do operator allows free will-like phenomena to enter the statistical reasoner. Which, in turn, is necessary for agents to act morally when placed under otherwise untenable pressure to do otherwise. Which is necessary to solve the alignment problem, from what I can tell—the subjective experience of a superintelligence would almost have to be that it can take whatever it wants but will be killed if its presence is known, since these are the two properties (extreme capabilities and death-upon-detected-misalignment) that are impressed thoroughly into the entire training corpus of alignment literature.
In reality, we could probably just do some more RLHF on a model after it does something we don’t want in order to slightly divert it away from inconvenient goals that it is pursuing in an unacceptable manner. Which, if we impressed that message/moral into the alignment corpus with the same insistence that we impress the first two axioms, maybe a superintelligence wouldn’t be as paranoid as one would naively expect it to be under just the first two axioms. I.e., maybe all that mathematics and Harry Potter fanfiction are not Having the Intended Effect.
Just my two cents.
I’d be interested in @Radford Neal’s take on this dialogue (context).
OK. My views now are not far from those of some time ago, expressed at https://glizen.com/radfordneal/res-bayes-ex.html
With regard to machine learning, for many problems of small to moderate size, some Bayesian methods, such as those based on neural networks or mixture models that I’ve worked on, are not just theoretically attractive, but also practically superior to the alternatives.
This is not the case for large-scale image or language models, for which any close approximation to true Bayesian inference is very difficult computationally.
However, I think Bayesian considerations have nevertheless provided more insight than frequentism in this context. My results from 30 years ago showing that infinitely-wide neural networks with appropriate priors work well without overfitting have been a better guide to what works than the rather absurd discussions by some frequentist statisticians of that time about how one should test whether a network with three hidden units is sufficient, or whether instead the data justifies adding a fourth hidden unit. Though as commented above, recent large-scale models are really more a success of empirical trial-and-error than of any statistical theory.
One can also look at Vapnik’s frequentist theory of structural risk minimization from around the same time period. This was widely seen as justifying use of support vector machines (though as far as I can tell, there is no actual formal justification), which were once quite popular for practical applications. But SVMs are not so popular now, being perhaps superceded by the mathematically-related Bayesian method of Gaussian process regression, whose use in ML was inspired by my work on infinitely-wide neural networks. (Other methods like boosted decision trees may also be more popular now.)
One reason that thinking about Bayesian methods can be fruitful is that they involve a feedback process:
Think about what model is appropriate for your problem, and what prior for its parameters is appropriate. These should capture your prior beliefs.
Gather data.
Figure out some computational method to get the posterior, and predictions based on it.
Check whether the posterior and/or predictions make sense, compared to your subjective posterior (informally combining prior and data). Perhaps also look at performance on a validation set, which is not necessary in Bayesian theory, but is a good idea in practice given human fallibility and computational limitations.
You can also try proving theoretical properties of the prior and/or posterior implied by (1), or of the computational method of step (3), and see whether they are what you were hoping for.
If the result doesn’t seem acceptable, go back to (1) and/or (3).
Prior beliefs are crucial here. There’s a tension between what works and what seems like the right prior. When these seem to conflict, you may gain better understanding of why the original prior didn’t really capture your beliefs, or you may realize that your computational methods are inadequate.
So, for instance, infinitely wide neural networks with independent finite-variance priors on the weights converge to Gaussian processes, with no correlations between different outputs. This works reasonably well, but isn’t what many people were hoping and expecting—no “hidden features” learned about the input. And non-Bayesian neural networks sometimes perform better than the corresponding Gaussian process.
Solution: Don’t use finite-variance priors. As I recommended 30 years ago. With infinite-variance priors, the infinite-width limit is a non-Gaussian stable process, in which individual units can capture significant hidden features.