A tangential question: Does the overfitting issue from Bayesian statistics have an analog in Bayesian epistemology, i.e. when we only deal with propositional subjective degrees of belief, not with random variables and models?
I think the problem is the same in both cases. Roughly speaking, there is some “appropriate amount” of belief updating to try to fit your experiences, and this appropriate amount is described by Bayes’ rule under ideal conditions where
it’s computationally feasible to perform the full Bayesian update, and
the correct model is within the class of models you’re performing the update over.
If either of these is not true, then in general you don’t know which update is good. If your class of models is particularly bad, it can be preferable to stick to an ignorance prior and perform no update at all.
Asymptotically, all update rules within the tempered Bayes paradigm (Bayes but likelihoods are raised to an exponent that’s not in general equal to 1) in a stationary environment (i.i.d. samples and such) converge to MLE, where you have guarantees of eventually landing in a part of your model space which has minimal KL divergence with the true data generating process. However, this is an asymptotic guarantee, so it doesn’t necessarily tell us what we should be doing when our sample is finite. Moreover, this guarantee is no longer valid if the data-generating process is not stationary, e.g. if you’re drawing one long string of correlated samples from a distribution instead of many independent samples.
Using Bayes’ rule at least gets the right credence ratios between the different models you’re considering, but it’s not clear if this is optimal from the point of view of e.g. an agent trying to maximize expected utility in an environment.
I think in practice the way people deal with these problems is to use a “lazily evaluated” version of the Bayesian paradigm. They start with an initial class of models M, and perform usual Bayes until they notice that none of the models in M seem to fit the data very well. They then search for an expanded class of models M′⊃M which can still fit the data well while trying to balance between the increased dimensionality of the models in M′ and their better fit with data, and if a decent match is found, they keep using M′ from that point on, etc.
You may well be right with all this! But I meant something else here, namely “Bayesian epistemology” in the sense of Bayesianism in philosophy, without any statistics, i.e. without notions of (a class of) models, sampling, or a true data generating process. It seems Bayesian statisticians and epistemologists often have problems understanding each other.
I admit this is a bit derailing the discussion, since I lack the statistical background and can’t comment on your thoughts of this post. So feel free to ignore what follows.
To clarify, epistemologists usually identify their type of Bayesianism with the two epistemic rationality assumptions:
Probabilism (synchronic norm): At any point in time, all beliefs of an agent should satisfy the axioms of probability theory, where a probability function P describes the degrees of belief (at some point in time) of an agent, and the objects of belief are propositions which may be combined with propositional logical connectives like negation, conjunction etc. (Similar to sets/”events” in standard statistics.)
Conditionalization (diachronic norm): If an agent has some degree of belief in some proposition H, and they observe a new piece of evidence E (where E is also a proposition), they should update H according to to the principle of conditionalization Pnew(H):=Pold(H|E), where Pnew and Pold describe the beliefs at the points in time immediately before and after “learning” that E.
(Conditionalization implies that Pnew(E)=1, so this norm is only plausible when the evidence is learned with certainty, like direct experience. It also implies “regularity”, namely that Pnew(H|E)=Pold(H|E). Note also that if Bayes’ theorem is applied to to right hand side of the principle of conditionalization, we get something that is usually called Bayes’ rule or simply Bayesian updating.)
Now probabilism is relatively uncontroversial, at least as an idealization which doesn’t take cognitive/computational constraints into account. But conditionalization, the updating rule, is more controversial for various reasons, and some epistemologists reject an updating rule outright (Radical probabilism) or they replace conditionalization with some other updating rule. Others accept it but add at rule for cases where evidence isn’t learned with certainty.
Relevant for the topic of overfitting is that conditionalization could be interpreted as “fitting” H and E in the wrong way (not sure though whether this has any connection to “over”fitting.) Recall that conditionalization implies regularity, that Pnew(H|E)=Pold(H|E), i.e. that the conditional probabilities always stay fixed and only P(H) changes.
But that sounds perhaps too dogmatic. For example, assume that at first you think E and H are strongly negatively dependent, i.e. Pold(E and H)<<Pold(E)×Pold(H). And you believe in H, that is, Pold(H) being high, and disbelieve in E, Pold(E) being low. Which means you expect E would strongly disconfirm H, i.e.Pold(H|E) is low.
If you now indeed observe E (Pnew(E)=1), that would be strong evidence against H, i.e. Pnew(H)<<Pold(H). And conditionalization concurs. But it seems the observation of E would be also be some evidence against E and H being as strongly negatively dependent as you assumed before, i.e. it would be evidence against P(H|E) being as low as you previously thought, which would mean you should increase your conditional probability here, i.e. Pnew(H|E)>Pold(H|E). But conditionalization would forbid that, it assumes you can’t update the conditional probabilities here, just the marginal probability of H.
To make this more clear, there is an analogy with “outright beliefs”: Assume for a moment that beliefs don’t come in degrees (e.g. as in belief revision theory or epistemic logic), and that you either “believe” a proposition p, or you “disbelieve” p (which just means you “believe” not-p), or you neither believe nor disbelieve p (you are agnostic about p). Then your beliefs can be described simply as a set Bel of propositions (the proportions you believe in). Then a plausible rationality constraint would be that Bel, your beliefs, should be logically consistent. (This would be the analogue to the norm probabilism which says that your gradual beliefs should be probabilistically coherent.)
Now assume your belief set Bel contains initially exactly these three propositions, mirroring those in the probabilistic example above:
Belold ={[If E then not-H], [H], [not-E]}
(For the gradual belief case, [If E then not-H] corresponds to ”Pold(H|E) is low”, [H] corresponds to ”Pold(H) is high”, [not-E] corresponds to ”Pold(E) is low”)
Then assume again that you observe/learn E. So you replace [not-E] with [E]:
Belnew ={[If E then not-H], [H], [E]}
But now your belief set is logically inconsistent! Any two entail the negation of the third. We need an updating rule which handles such cases, beyond just dropping the old “disbelief” in E and adding the new belief in E.
The first option would be to abandon [H] and keep [If E then not-H] (and [E]). This would correspond to the effect of conditionalization in the gradual belief case.
The second option would be to abandon the belief [If E then not-H] and keep [H] (and [E]).
Both of the above options would remove the inconsistentcy, but they are arbitrarily biased against either [If E then not-H] or [H].
A third option would be not to add [E] in the first place and be content with removing just [not-E].
A fourth option would be to remove both [If E then not-H] and [H] while only keeping [E].
Now it seems clear that neither 1 nor 2 are appropriate updating rules, since they are arbitrarily biased against one of the old beliefs. But conditionalization corresponds to 1. So it seems this is evidence that conditionalization is unacceptable.
Note that in the probabilistic case, an equivalent of 4 does seem quite plausible. Instead of strongly lowering the credence P(H), we can instead do a medium lowering of bothP(H) and P(not-E|H). The latter is equivalent to increasing P(E|H) a medium amount. Which is incompatible with conditionalization, since it requires that P(E|H) is constant during updating.
Anyway, this is my case for conditionalization (and therefore Bayes’ rule) producing “the wrong fit” between E and H. Though I’m unsure whether this can be interpreted as overfitting somehow.
I think the problem is the same in both cases. Roughly speaking, there is some “appropriate amount” of belief updating to try to fit your experiences, and this appropriate amount is described by Bayes’ rule under ideal conditions where
it’s computationally feasible to perform the full Bayesian update, and
the correct model is within the class of models you’re performing the update over.
If either of these is not true, then in general you don’t know which update is good. If your class of models is particularly bad, it can be preferable to stick to an ignorance prior and perform no update at all.
Asymptotically, all update rules within the tempered Bayes paradigm (Bayes but likelihoods are raised to an exponent that’s not in general equal to 1) in a stationary environment (i.i.d. samples and such) converge to MLE, where you have guarantees of eventually landing in a part of your model space which has minimal KL divergence with the true data generating process. However, this is an asymptotic guarantee, so it doesn’t necessarily tell us what we should be doing when our sample is finite. Moreover, this guarantee is no longer valid if the data-generating process is not stationary, e.g. if you’re drawing one long string of correlated samples from a distribution instead of many independent samples.
Using Bayes’ rule at least gets the right credence ratios between the different models you’re considering, but it’s not clear if this is optimal from the point of view of e.g. an agent trying to maximize expected utility in an environment.
I think in practice the way people deal with these problems is to use a “lazily evaluated” version of the Bayesian paradigm. They start with an initial class of models M, and perform usual Bayes until they notice that none of the models in M seem to fit the data very well. They then search for an expanded class of models M′⊃M which can still fit the data well while trying to balance between the increased dimensionality of the models in M′ and their better fit with data, and if a decent match is found, they keep using M′ from that point on, etc.
You may well be right with all this! But I meant something else here, namely “Bayesian epistemology” in the sense of Bayesianism in philosophy, without any statistics, i.e. without notions of (a class of) models, sampling, or a true data generating process. It seems Bayesian statisticians and epistemologists often have problems understanding each other.
I admit this is a bit derailing the discussion, since I lack the statistical background and can’t comment on your thoughts of this post. So feel free to ignore what follows.
To clarify, epistemologists usually identify their type of Bayesianism with the two epistemic rationality assumptions:
Probabilism (synchronic norm): At any point in time, all beliefs of an agent should satisfy the axioms of probability theory, where a probability function P describes the degrees of belief (at some point in time) of an agent, and the objects of belief are propositions which may be combined with propositional logical connectives like negation, conjunction etc. (Similar to sets/”events” in standard statistics.)
Conditionalization (diachronic norm): If an agent has some degree of belief in some proposition H, and they observe a new piece of evidence E (where E is also a proposition), they should update H according to to the principle of conditionalization Pnew(H):=Pold(H|E), where Pnew and Pold describe the beliefs at the points in time immediately before and after “learning” that E.
(Conditionalization implies that Pnew(E)=1, so this norm is only plausible when the evidence is learned with certainty, like direct experience. It also implies “regularity”, namely that Pnew(H|E)=Pold(H|E). Note also that if Bayes’ theorem is applied to to right hand side of the principle of conditionalization, we get something that is usually called Bayes’ rule or simply Bayesian updating.)
Now probabilism is relatively uncontroversial, at least as an idealization which doesn’t take cognitive/computational constraints into account. But conditionalization, the updating rule, is more controversial for various reasons, and some epistemologists reject an updating rule outright (Radical probabilism) or they replace conditionalization with some other updating rule. Others accept it but add at rule for cases where evidence isn’t learned with certainty.
Relevant for the topic of overfitting is that conditionalization could be interpreted as “fitting” H and E in the wrong way (not sure though whether this has any connection to “over”fitting.) Recall that conditionalization implies regularity, that Pnew(H|E)=Pold(H|E), i.e. that the conditional probabilities always stay fixed and only P(H) changes.
But that sounds perhaps too dogmatic. For example, assume that at first you think E and H are strongly negatively dependent, i.e. Pold(E and H)<<Pold(E)×Pold(H). And you believe in H, that is, Pold(H) being high, and disbelieve in E, Pold(E) being low. Which means you expect E would strongly disconfirm H, i.e.Pold(H|E) is low.
If you now indeed observe E (Pnew(E)=1), that would be strong evidence against H, i.e. Pnew(H)<<Pold(H). And conditionalization concurs. But it seems the observation of E would be also be some evidence against E and H being as strongly negatively dependent as you assumed before, i.e. it would be evidence against P(H|E) being as low as you previously thought, which would mean you should increase your conditional probability here, i.e. Pnew(H|E)>Pold(H|E). But conditionalization would forbid that, it assumes you can’t update the conditional probabilities here, just the marginal probability of H.
To make this more clear, there is an analogy with “outright beliefs”: Assume for a moment that beliefs don’t come in degrees (e.g. as in belief revision theory or epistemic logic), and that you either “believe” a proposition p, or you “disbelieve” p (which just means you “believe” not-p), or you neither believe nor disbelieve p (you are agnostic about p). Then your beliefs can be described simply as a set Bel of propositions (the proportions you believe in). Then a plausible rationality constraint would be that Bel, your beliefs, should be logically consistent. (This would be the analogue to the norm probabilism which says that your gradual beliefs should be probabilistically coherent.)
Now assume your belief set Bel contains initially exactly these three propositions, mirroring those in the probabilistic example above:
Belold ={[If E then not-H], [H], [not-E]}
(For the gradual belief case, [If E then not-H] corresponds to ”Pold(H|E) is low”, [H] corresponds to ”Pold(H) is high”, [not-E] corresponds to ”Pold(E) is low”)
Then assume again that you observe/learn E. So you replace [not-E] with [E]:
Belnew ={[If E then not-H], [H], [E]}
But now your belief set is logically inconsistent! Any two entail the negation of the third. We need an updating rule which handles such cases, beyond just dropping the old “disbelief” in E and adding the new belief in E.
The first option would be to abandon [H] and keep [If E then not-H] (and [E]). This would correspond to the effect of conditionalization in the gradual belief case.
The second option would be to abandon the belief [If E then not-H] and keep [H] (and [E]).
Both of the above options would remove the inconsistentcy, but they are arbitrarily biased against either [If E then not-H] or [H].
A third option would be not to add [E] in the first place and be content with removing just [not-E].
A fourth option would be to remove both [If E then not-H] and [H] while only keeping [E].
Now it seems clear that neither 1 nor 2 are appropriate updating rules, since they are arbitrarily biased against one of the old beliefs. But conditionalization corresponds to 1. So it seems this is evidence that conditionalization is unacceptable.
Note that in the probabilistic case, an equivalent of 4 does seem quite plausible. Instead of strongly lowering the credence P(H), we can instead do a medium lowering of both P(H) and P(not-E|H). The latter is equivalent to increasing P(E|H) a medium amount. Which is incompatible with conditionalization, since it requires that P(E|H) is constant during updating.
Anyway, this is my case for conditionalization (and therefore Bayes’ rule) producing “the wrong fit” between E and H. Though I’m unsure whether this can be interpreted as overfitting somehow.