You’re right, you could have an event in the event space which is just “the virtua-evidence update [such-and-such]”. I’m actually going to pull out this trick in a future follow-up post.
I note that that’s not how Pearl or Jeffrey understand these updates. And it’s a peculiar thing to do—something happens to make you update a particular amount, but you’re just representing the event by the amount you update. Virtual evidence as-usually-understood at least coins a new symbol to represent the hard-to-articulate thing you’re updating on.
That’s not quite what I had in mind, but I can see how my ‘continuously valued’ comment might have thrown you off. A more concrete example might help: consider Example 2 in this paper. It posits three events:
b—my house was burgled
a—my alarm went off
z—my neighbor calls to tell me the alarm went off
Pearl’s method is to take what would be uncertain information about a (via my model of my neighbor and the fact she called me) and transform it into virtual evidence (which includes the likelihood ratio). What I’m saying is that you can just treat z as being an event itself, and do a Bayesian update from the likelihood P(z|b)=P(z|a)P(a|b)+P(z|~a)P(~a|b), etc. This will give you the exact same posterior as Pearl. Really, the only difference in these formulations is that Pearl only needs to know the ratio P(z|a):P(z|~a), whereas traditional Bayesian update requires actual values. Of course, any set of values consistent with the ratio will produce the right answer.
The slightly more complex case (and why I mentioned continuous values) is in section 5 where the message includes probability data, such as a likelihood ratio. Note that the continuous value is not the amount you update (at least not generally), because its not generated from your own models, but rather by the messenger. Consider event z99, where my neighbor calls to say she’s 99% sure the alarm went off. This doesn’t mean I have to treat P(z99|b):P(z99|~b) as 99:1 - I might model my neighbor as being poorly calibrated (or as not being independent of other information I already have), and use some other ratio.
In what sense? What technical claim about Bayesian updates are you trying to refer to?
Definitely the second one, as optimal update policy. Responding to your specific objections:
This is only true if the only information we have coming in is a sequence of propositions which we are updating 100% on.
As you’ll hopefully agree with at this point, we can always manufacture the 100% condition by turning it into virtual evidence.
This optimality property onlymakes sense if we believe something like grain-of-truth.
I believe I previously conceded this point—the true hypothesis (or at least a ‘good enough’ one) must have a nonzero probability, which we can’t guarantee.
But properties such as calibration and convergence also have intuitive appeal
Re: calibration—I still believe that this can be included if you are jointly estimating your model and your hypothesis.
Re: convergence—how real of a problem is this? In your example you had two hypotheses that were precisely equally wrong. Does convergence still fail if the true probability is 0.500001 ?
(By the way, I really appreciate your in-depth engagement with my position.)
Likewise! This has certainly been educational, especially in light of this:
Sadly, the actual machinery of logical induction was beyond the scope of this post, but there are answers. I just don’t yet know a good way to present it all as a nice, practical, intuitively appealing package.
The solution is too large to fit in the margins, eh? j/k, I know there’s a real paper. Should I go break my brain trying to read it, or wait for your explanation?
The solution is too large to fit in the margins, eh? j/k, I know there’s a real paper. Should I go break my brain trying to read it, or wait for your explanation?
Oh, I definitely don’t have a better explanation of that in the works at this point.
Re: convergence—how real of a problem is this? In your example you had two hypotheses that were precisely equally wrong. Does convergence still fail if the true probability is 0.500001 ?
My main concern here is if there’s an adversarial process taking advantage of a system, as in the trolling mathematicians work.
In the case of mathematical reasoning, though, the problem is quite severe—as is hopefully clear from what I’ve linked above, although an adversary can greatly exacerbate the problem, even a normal non-adversarial stream of evidence is going to keep flipping the probability up and down by non-negligible amounts. (And although the post offers a solution, it’s a pretty dumb prior, and I argue all priors which avoid this problem will be similarly dumb.)
Re: calibration—I still believe that this can be included if you are jointly estimating your model and your hypothesis.
I don’t get this at all! What do you mean?
As you’ll hopefully agree with at this point, we can always manufacture the 100% condition by turning it into virtual evidence.
As I discussed earlier, I agree, but with the major caveat that the likelihoods from the update aren’t found by first determining what you’re updating on, and then looking up the likelihoods for that; instead, we have to determine the likelihoods and then update on the virtual-evidence proposition with those likelihoods.
That’s not quite what I had in mind, but I can see how my ‘continuously valued’ comment might have thrown you off. A more concrete example might help: consider Example 2 in this paper. It posits three events:
[...]
What I’m saying is that you can just treat z as being an event itself, and do a Bayesian update from the likelihood [...]
Hmmmm. Unfortunately I’m not sure what to say to this one except that in logical induction, there’s not generally a pre-existing z we can update on like that.
Take the example of calibration traders. These guys can be described as moving probabilities up and down to account for the calibration curves so far. But that doesn’t really mean that you “update on the calibration trader” and e.g. move your 90% probabilities down to 89% in response. Instead, what happens is that the system takes a fixed-point, accounting for the calibration traders and also everything else, finding a point where all the various influences balance out. This point becomes the next overall belief distribution.
So the actual update is a fixed-point calculation, which isn’t at all a nice formula such as multiplying all the forces pushing probabilities in different directions (finding the fixed point isn’t even a continuous function).
We can make it into a Bayesian update on 100%-confident evidence by modeling it as virtual evidence, but the virtual evidence is sorta pulled from nowhere, just an arbitrary thing that gets us from the old belief state to the new one. The actual calculation of the new belief state is, as I said, the big fixed point operation.
You can’t even say that we’re updating on all those forces pulling the distribution in different directions, because there are more than one fixed points of those forces. We don’t want to have uncertainty about which of those fixed points we end up in; that would give the wrong thing. So we really have to update on the fixed-point itself, which is already the answer to what to update to; not some information which we have pre-existing beliefs about and can figure out likelihood ratios for to figure out what to update to.
So that’s my real crux, and any examples with telephone calls and earthquakes etc are merely illustrative for me. (Like I said, I don’t know how to actually motivate any of this stuff except with actual logical uncertainty, and I’m surprised that any philosophers would have become convinced just from other sorts of examples.)
Hmmmm. Unfortunately I’m not sure what to say to this one except that in logical induction, there’s not generally a pre-existing z we can update on like that.
So that’s my real crux, and any examples with telephone calls and earthquakes etc are merely illustrative for me. (Like I said, I don’t know how to actually motivate any of this stuff except with actual logical uncertainty, and I’m surprised that any philosophers would have become convinced just from other sorts of examples.)
I agree that the logical induction case is different, since it’s hard to conceive of likelihoods to begin with. Basically, logical induction doesn’t even include what I would call virtual evidence. But many of the examples you gave do have such a z. I think I agree with your crux, and my main critique here is just in the examples of overly dogmatic Bayesian who refuses to acknowledge the difference between a and z. I won’t belabor the point further.
I’ve thought of another motivating example, BTW. In wartime, your enemy deliberately sends you some verifiably true information about their force dispositions. How should you update on that? You can’t use a Bayesian update, since you don’t actually have a likelihood model available. We can’t even attempt to learn a model from the information, since we can’t be sure its representative.
I don’t get this at all! What do you mean?
By model M, I mean an algorithm that generates likelihood functions, so M(H,Z) = P(Z|H).
So any time we talk about a likelihood P(Z|H), it should really read P(Z|H,M). We’ll posit that P(H,M) = P(H)P(M) (i.e. that the model says nothing about our priors), but this isn’t strictly necessary.
E(P(Z|H,M)) will be higher for a well calibrated model than a poorly calibrated model, which means that we expect P(H,M|Z) to also be higher. When we then marginalize over the models to get a final posterior on the hypothesis P(H|Z), it will be dominated by the well-calibrated models: P(H|Z) = SUM_i P(H|M_i,Z)P(M_i|Z).
BTW, I had a chance to read part of the ILA paper. It barely broke my brain at all! I wonder if the trick of enumerating traders and incorporating them over time could be repurposed to a more Bayesianish context, by instead enumerating models M. Like the trading firm in ILA, a meta-Bayesian algorithm could keep introducing new models M_k over time, with some intuition that the calibration of the best model in the set would improve over time, perhaps giving it all those nice anti-dutch book properties. Basically this is a computable Solomonoff induction, that slowly approaches completeness in the limit. (I’m pretty sure this is not an original idea. I wouldn’t be surprised if something like this contributed to the ILA itself).
Of course, its pretty unclear how this would work in the logical induction case. This might all be better explained in its own post.
That’s not quite what I had in mind, but I can see how my ‘continuously valued’ comment might have thrown you off. A more concrete example might help: consider Example 2 in this paper. It posits three events:
b—my house was burgled
a—my alarm went off
z—my neighbor calls to tell me the alarm went off
Pearl’s method is to take what would be uncertain information about a (via my model of my neighbor and the fact she called me) and transform it into virtual evidence (which includes the likelihood ratio). What I’m saying is that you can just treat z as being an event itself, and do a Bayesian update from the likelihood P(z|b)=P(z|a)P(a|b)+P(z|~a)P(~a|b), etc. This will give you the exact same posterior as Pearl. Really, the only difference in these formulations is that Pearl only needs to know the ratio P(z|a):P(z|~a), whereas traditional Bayesian update requires actual values. Of course, any set of values consistent with the ratio will produce the right answer.
The slightly more complex case (and why I mentioned continuous values) is in section 5 where the message includes probability data, such as a likelihood ratio. Note that the continuous value is not the amount you update (at least not generally), because its not generated from your own models, but rather by the messenger. Consider event z99, where my neighbor calls to say she’s 99% sure the alarm went off. This doesn’t mean I have to treat P(z99|b):P(z99|~b) as 99:1 - I might model my neighbor as being poorly calibrated (or as not being independent of other information I already have), and use some other ratio.
Definitely the second one, as optimal update policy. Responding to your specific objections:
As you’ll hopefully agree with at this point, we can always manufacture the 100% condition by turning it into virtual evidence.
I believe I previously conceded this point—the true hypothesis (or at least a ‘good enough’ one) must have a nonzero probability, which we can’t guarantee.
Re: calibration—I still believe that this can be included if you are jointly estimating your model and your hypothesis.
Re: convergence—how real of a problem is this? In your example you had two hypotheses that were precisely equally wrong. Does convergence still fail if the true probability is 0.500001 ?
Likewise! This has certainly been educational, especially in light of this:
The solution is too large to fit in the margins, eh? j/k, I know there’s a real paper. Should I go break my brain trying to read it, or wait for your explanation?
Oh, I definitely don’t have a better explanation of that in the works at this point.
My main concern here is if there’s an adversarial process taking advantage of a system, as in the trolling mathematicians work.
In the case of mathematical reasoning, though, the problem is quite severe—as is hopefully clear from what I’ve linked above, although an adversary can greatly exacerbate the problem, even a normal non-adversarial stream of evidence is going to keep flipping the probability up and down by non-negligible amounts. (And although the post offers a solution, it’s a pretty dumb prior, and I argue all priors which avoid this problem will be similarly dumb.)
I don’t get this at all! What do you mean?
As I discussed earlier, I agree, but with the major caveat that the likelihoods from the update aren’t found by first determining what you’re updating on, and then looking up the likelihoods for that; instead, we have to determine the likelihoods and then update on the virtual-evidence proposition with those likelihoods.
Hmmmm. Unfortunately I’m not sure what to say to this one except that in logical induction, there’s not generally a pre-existing z we can update on like that.
Take the example of calibration traders. These guys can be described as moving probabilities up and down to account for the calibration curves so far. But that doesn’t really mean that you “update on the calibration trader” and e.g. move your 90% probabilities down to 89% in response. Instead, what happens is that the system takes a fixed-point, accounting for the calibration traders and also everything else, finding a point where all the various influences balance out. This point becomes the next overall belief distribution.
So the actual update is a fixed-point calculation, which isn’t at all a nice formula such as multiplying all the forces pushing probabilities in different directions (finding the fixed point isn’t even a continuous function).
We can make it into a Bayesian update on 100%-confident evidence by modeling it as virtual evidence, but the virtual evidence is sorta pulled from nowhere, just an arbitrary thing that gets us from the old belief state to the new one. The actual calculation of the new belief state is, as I said, the big fixed point operation.
You can’t even say that we’re updating on all those forces pulling the distribution in different directions, because there are more than one fixed points of those forces. We don’t want to have uncertainty about which of those fixed points we end up in; that would give the wrong thing. So we really have to update on the fixed-point itself, which is already the answer to what to update to; not some information which we have pre-existing beliefs about and can figure out likelihood ratios for to figure out what to update to.
So that’s my real crux, and any examples with telephone calls and earthquakes etc are merely illustrative for me. (Like I said, I don’t know how to actually motivate any of this stuff except with actual logical uncertainty, and I’m surprised that any philosophers would have become convinced just from other sorts of examples.)
I agree that the logical induction case is different, since it’s hard to conceive of likelihoods to begin with. Basically, logical induction doesn’t even include what I would call virtual evidence. But many of the examples you gave do have such a z. I think I agree with your crux, and my main critique here is just in the examples of overly dogmatic Bayesian who refuses to acknowledge the difference between a and z. I won’t belabor the point further.
I’ve thought of another motivating example, BTW. In wartime, your enemy deliberately sends you some verifiably true information about their force dispositions. How should you update on that? You can’t use a Bayesian update, since you don’t actually have a likelihood model available. We can’t even attempt to learn a model from the information, since we can’t be sure its representative.
By model M, I mean an algorithm that generates likelihood functions, so M(H,Z) = P(Z|H).
So any time we talk about a likelihood P(Z|H), it should really read P(Z|H,M). We’ll posit that P(H,M) = P(H)P(M) (i.e. that the model says nothing about our priors), but this isn’t strictly necessary.
E(P(Z|H,M)) will be higher for a well calibrated model than a poorly calibrated model, which means that we expect P(H,M|Z) to also be higher. When we then marginalize over the models to get a final posterior on the hypothesis P(H|Z), it will be dominated by the well-calibrated models: P(H|Z) = SUM_i P(H|M_i,Z)P(M_i|Z).
BTW, I had a chance to read part of the ILA paper. It barely broke my brain at all! I wonder if the trick of enumerating traders and incorporating them over time could be repurposed to a more Bayesianish context, by instead enumerating models M. Like the trading firm in ILA, a meta-Bayesian algorithm could keep introducing new models M_k over time, with some intuition that the calibration of the best model in the set would improve over time, perhaps giving it all those nice anti-dutch book properties. Basically this is a computable Solomonoff induction, that slowly approaches completeness in the limit. (I’m pretty sure this is not an original idea. I wouldn’t be surprised if something like this contributed to the ILA itself).
Of course, its pretty unclear how this would work in the logical induction case. This might all be better explained in its own post.