If zero and one aren’t probabilities, how does Bayesian conditioning work? My understanding is that a Bayesian has to be certain of the truth of whatever proposition that she conditions on when updating.
Zero and one are probabilities. The apparent opposite claim is a hyperbole intended to communicate something else, but people on LessWrong persistently make the mistake of taking it literally. For examples of 0 and 1 appearing unavoidably in the theory of probability, P(A|A) =1 and P(A|-A)=0. If someone disputes either of these formulae, the onus is on them to rebuild probability theory in a way that avoids them. As far as I know, no-one has even attempted this.
But P(A|B) = P(A&B)/P(B) for any positive value of P(B). You can condition on evidence all day without ever needing to assert a certainty about anything. Your conclusions will all be hypothetical, of the form “if this is the prior over A and this B is the evidence, this is the posterior over A”. If the evidence is uncertain, this can be incorporated into the calculation, giving conclusions of the form “given this prior over A and this probability distribution over possible evidence B, this is the posterior over A.”
If you are uncertain even of the probability distribution over B, then a hard-core Bayesian will say that that uncertainty is modelled by a distribution over distributions of B, which can be folded down into a distribution over B. Soft-core Bayesians will scoff at this, and turn to magic, a.k.a. model checking, human understanding, etc. Hard-core Bayesians will say that these only work to the extent that they approximate to Bayesian inference. Soft-core Bayesians aren’t listening at this point, but if they were they might challenge the hard-core Bayesians to produce an actual method that works better.
My understanding is that a Bayesian has to be certain of the truth of whatever proposition that she conditions on when updating.
This isn’t necessary. In many circumstances, you can approximate the probability of an observation you’re updating on to 1, such as an observation that a coin came up heads. An observation never literally has a probability of 1 (you could be hallucinating, or be a brain in a jar, etc.) Sometimes observations are uncertain enough that you can’t approximate them to 1, but you can still do the math to update on them (“Did I really see a mouse? I might have imagined it. Update on .7 probability observation of mouse.”)
Yeah, but if your observation does not have a probability of 1 then Bayesian conditionalization is the wrong update rule. I take it this was Alex’s point. If you updated on a 0.7 probability observation using Bayesian conditionalization, you would be vulnerable to a Dutch book. The correct update rule in this circumstance is Jeffrey conditionalization. If P1 is your distribution prior to the observation and P2 is the distribution after the observation, the update rule for a hypothesis H given evidence E is:
P2(H) = P1(H | E) P2(E) + P1(H | ~E) P2(~E)
If P2(E) is sufficiently close to 1, the contribution of the second term in the sum is negligible and Bayesian conditionalization is a fine approximation.
This is a strange distinction, Jeffrey conditionalization. A little google searching shows that someone got their name added to conditioning on E and ~E. To me that’s just a straight application of probability theory. It’s not like I just fell off the turnip truck, but I’ve never heard anyone give this a name before.
To get a marginal, you condition on what you know, and sum across the other things you don’t. I dislike the endless multiplication of terms for special cases where the general form is clear enough.
I dislike the endless multiplication of terms for special cases where the general form is clear enough.
I don’t know. i like having names for things. Makes it easier to refer to them. And to be fair to Jeffrey, while the update rule itself is a trivial consequence of probability theory (assuming the conditional probabilities are invariant), his reason for explicitly advocating it was the important epistemological point that absolute certainty (probability 1) is a sort of degenerate epistemic state. Think of his name being attached to the rule as recognition not of some new piece of math but of an insight into the nature of knowledge and learning.
If you observe X then the thing you update on is “I observed X” and not just “X”. Just because you observed something doesn’t mean it was necessarily the case (you could be hallucinating etc.). So while you don’t assign probability 1 to “X” you do assign probability 1 to “I observed X”, which is fine.
If zero and one aren’t probabilities, how does Bayesian conditioning work? My understanding is that a Bayesian has to be certain of the truth of whatever proposition that she conditions on when updating.
Zero and one are probabilities. The apparent opposite claim is a hyperbole intended to communicate something else, but people on LessWrong persistently make the mistake of taking it literally. For examples of 0 and 1 appearing unavoidably in the theory of probability, P(A|A) =1 and P(A|-A)=0. If someone disputes either of these formulae, the onus is on them to rebuild probability theory in a way that avoids them. As far as I know, no-one has even attempted this.
But P(A|B) = P(A&B)/P(B) for any positive value of P(B). You can condition on evidence all day without ever needing to assert a certainty about anything. Your conclusions will all be hypothetical, of the form “if this is the prior over A and this B is the evidence, this is the posterior over A”. If the evidence is uncertain, this can be incorporated into the calculation, giving conclusions of the form “given this prior over A and this probability distribution over possible evidence B, this is the posterior over A.”
If you are uncertain even of the probability distribution over B, then a hard-core Bayesian will say that that uncertainty is modelled by a distribution over distributions of B, which can be folded down into a distribution over B. Soft-core Bayesians will scoff at this, and turn to magic, a.k.a. model checking, human understanding, etc. Hard-core Bayesians will say that these only work to the extent that they approximate to Bayesian inference. Soft-core Bayesians aren’t listening at this point, but if they were they might challenge the hard-core Bayesians to produce an actual method that works better.
This isn’t necessary. In many circumstances, you can approximate the probability of an observation you’re updating on to 1, such as an observation that a coin came up heads. An observation never literally has a probability of 1 (you could be hallucinating, or be a brain in a jar, etc.) Sometimes observations are uncertain enough that you can’t approximate them to 1, but you can still do the math to update on them (“Did I really see a mouse? I might have imagined it. Update on .7 probability observation of mouse.”)
Yeah, but if your observation does not have a probability of 1 then Bayesian conditionalization is the wrong update rule. I take it this was Alex’s point. If you updated on a 0.7 probability observation using Bayesian conditionalization, you would be vulnerable to a Dutch book. The correct update rule in this circumstance is Jeffrey conditionalization. If P1 is your distribution prior to the observation and P2 is the distribution after the observation, the update rule for a hypothesis H given evidence E is:
P2(H) = P1(H | E) P2(E) + P1(H | ~E) P2(~E)
If P2(E) is sufficiently close to 1, the contribution of the second term in the sum is negligible and Bayesian conditionalization is a fine approximation.
This is a strange distinction, Jeffrey conditionalization. A little google searching shows that someone got their name added to conditioning on E and ~E. To me that’s just a straight application of probability theory. It’s not like I just fell off the turnip truck, but I’ve never heard anyone give this a name before.
To get a marginal, you condition on what you know, and sum across the other things you don’t. I dislike the endless multiplication of terms for special cases where the general form is clear enough.
I don’t know. i like having names for things. Makes it easier to refer to them. And to be fair to Jeffrey, while the update rule itself is a trivial consequence of probability theory (assuming the conditional probabilities are invariant), his reason for explicitly advocating it was the important epistemological point that absolute certainty (probability 1) is a sort of degenerate epistemic state. Think of his name being attached to the rule as recognition not of some new piece of math but of an insight into the nature of knowledge and learning.
If you observe X then the thing you update on is “I observed X” and not just “X”. Just because you observed something doesn’t mean it was necessarily the case (you could be hallucinating etc.). So while you don’t assign probability 1 to “X” you do assign probability 1 to “I observed X”, which is fine.