Why can’t virtual evidence messages be part of the event space? Is it because they are continuously valued?
Ah, now there’s a good question.
You’re right, you could have an event in the event space which is just “the virtua-evidence update [such-and-such]”. I’m actually going to pull out this trick in a future follow-up post.
I note that that’s not how Pearl or Jeffrey understand these updates. And it’s a peculiar thing to do—something happens to make you update a particular amount, but you’re just representing the event by the amount you update. Virtual evidence as-usually-understood at least coins a new symbol to represent the hard-to-articulate thing you’re updating on.
But that’s not much of an argument against the idea, especially when virtual evidence is already a peculiar practice.
Note that re-framing virtual evidence updates as Bayesian updates doesn’t change the fact that you’re now allowing essentially arbitrary updates rather than using Bayes’ Law to compute your update. You’re adjusting the formalism around an update to make Bayes’ Law true of it, not using Bayes’ Law to compute the update. So Bayes’ Law is still de-throned in the sense that it’s no longer the formula used to compute updates.
As to why one would want to have Bayesian updates be normative: one answer is that they maximize our predictive power, given sufficient compute. Given the name of this website, that seems a sufficient reason.
I’m going to focus on this even though you more or less concede the point later on in your reply. (By the way, I really appreciate your in-depth engagement with my position.)
In what sense? What technical claim about Bayesian updates are you trying to refer to?
One candidate is the dominance of Solomonoff induction over any computable machine learning technique. This has several problems:
Solomonoff induction only gets you this guarantee in the sequential prediction setting, whereas logical induction gets a qualitatively similar guarantee in a broader setting.
“sufficiently much computational power” is infinite, whereas the “sufficiently much computational power” for logical induction is finite. So logical induction would seem to get closer to offering real advice for bounded agents, rather than idealized decision theory whose application to bounded agents requires further insights.
Thus, logical induction is more suited to handling logical uncertainty, which is critical for handling bounded agents. Logical uncertainty becomes less confusing when non-Bayesian updates are allowed.
It can display irrational behaviors such as miscalibration, if grain-of-truth does not apply. These failures of rationality can be independently concerning, even given a nice guarantee about Bayes loss.
Note that I did not put the point “Solomonoff’s dominance property requires grain-of-truth” in this list, because it doesn’t—in a Bayes Loss sense, Solomonoff still does almost as well as any computable machine learning technique even in non-realizable settings where it may not converge to any one computable hypothesis. (Even though better calibration does tend to improve Bayes loss, we know it isn’t losing too much to this.)
Another candidate is that Bayes’ law is the optimal update policy when starting with a particular prior and trying to minimize Bayes loss. But again this has several problems.
This is only true if the only information we have coming in is a sequence of propositions which we are updating 100% on. Realistically, we can revise the prior to be better just by thinking about it (due to our logical uncertainty), making non-Bayesian updates relevant.
This optimality property onlymakes sense if we believe something like grain-of-truth. “I’ve listed all the possibilities in my prior, and I’m hedging against them according to my degree of belief in them. What more do you want from me?”
Rationality isn’t just about minimizing Bayes loss. Bayes loss is a compelling property in part because it gets us other intuitively compelling properties (such as Bayes’ Law). But properties such as calibration and convergence also have intuitive appeal, and we can back this appeal up with Dutch Book and Dutch-Book-like arguments.
A second answer you hint at here:
>The second seems more practical for the working Bayesian.
As a working Bayesian myself, having a practical update rule is quite useful! As far as I can tell, I don’t see a good alternative in what you have provided.
Sadly, the actual machinery of logical induction was beyond the scope of this post, but there are answers. I just don’t yet know a good way to present it all as a nice, practical, intuitively appealing package.
You’re right, you could have an event in the event space which is just “the virtua-evidence update [such-and-such]”. I’m actually going to pull out this trick in a future follow-up post.
I note that that’s not how Pearl or Jeffrey understand these updates. And it’s a peculiar thing to do—something happens to make you update a particular amount, but you’re just representing the event by the amount you update. Virtual evidence as-usually-understood at least coins a new symbol to represent the hard-to-articulate thing you’re updating on.
That’s not quite what I had in mind, but I can see how my ‘continuously valued’ comment might have thrown you off. A more concrete example might help: consider Example 2 in this paper. It posits three events:
b—my house was burgled
a—my alarm went off
z—my neighbor calls to tell me the alarm went off
Pearl’s method is to take what would be uncertain information about a (via my model of my neighbor and the fact she called me) and transform it into virtual evidence (which includes the likelihood ratio). What I’m saying is that you can just treat z as being an event itself, and do a Bayesian update from the likelihood P(z|b)=P(z|a)P(a|b)+P(z|~a)P(~a|b), etc. This will give you the exact same posterior as Pearl. Really, the only difference in these formulations is that Pearl only needs to know the ratio P(z|a):P(z|~a), whereas traditional Bayesian update requires actual values. Of course, any set of values consistent with the ratio will produce the right answer.
The slightly more complex case (and why I mentioned continuous values) is in section 5 where the message includes probability data, such as a likelihood ratio. Note that the continuous value is not the amount you update (at least not generally), because its not generated from your own models, but rather by the messenger. Consider event z99, where my neighbor calls to say she’s 99% sure the alarm went off. This doesn’t mean I have to treat P(z99|b):P(z99|~b) as 99:1 - I might model my neighbor as being poorly calibrated (or as not being independent of other information I already have), and use some other ratio.
In what sense? What technical claim about Bayesian updates are you trying to refer to?
Definitely the second one, as optimal update policy. Responding to your specific objections:
This is only true if the only information we have coming in is a sequence of propositions which we are updating 100% on.
As you’ll hopefully agree with at this point, we can always manufacture the 100% condition by turning it into virtual evidence.
This optimality property onlymakes sense if we believe something like grain-of-truth.
I believe I previously conceded this point—the true hypothesis (or at least a ‘good enough’ one) must have a nonzero probability, which we can’t guarantee.
But properties such as calibration and convergence also have intuitive appeal
Re: calibration—I still believe that this can be included if you are jointly estimating your model and your hypothesis.
Re: convergence—how real of a problem is this? In your example you had two hypotheses that were precisely equally wrong. Does convergence still fail if the true probability is 0.500001 ?
(By the way, I really appreciate your in-depth engagement with my position.)
Likewise! This has certainly been educational, especially in light of this:
Sadly, the actual machinery of logical induction was beyond the scope of this post, but there are answers. I just don’t yet know a good way to present it all as a nice, practical, intuitively appealing package.
The solution is too large to fit in the margins, eh? j/k, I know there’s a real paper. Should I go break my brain trying to read it, or wait for your explanation?
The solution is too large to fit in the margins, eh? j/k, I know there’s a real paper. Should I go break my brain trying to read it, or wait for your explanation?
Oh, I definitely don’t have a better explanation of that in the works at this point.
Re: convergence—how real of a problem is this? In your example you had two hypotheses that were precisely equally wrong. Does convergence still fail if the true probability is 0.500001 ?
My main concern here is if there’s an adversarial process taking advantage of a system, as in the trolling mathematicians work.
In the case of mathematical reasoning, though, the problem is quite severe—as is hopefully clear from what I’ve linked above, although an adversary can greatly exacerbate the problem, even a normal non-adversarial stream of evidence is going to keep flipping the probability up and down by non-negligible amounts. (And although the post offers a solution, it’s a pretty dumb prior, and I argue all priors which avoid this problem will be similarly dumb.)
Re: calibration—I still believe that this can be included if you are jointly estimating your model and your hypothesis.
I don’t get this at all! What do you mean?
As you’ll hopefully agree with at this point, we can always manufacture the 100% condition by turning it into virtual evidence.
As I discussed earlier, I agree, but with the major caveat that the likelihoods from the update aren’t found by first determining what you’re updating on, and then looking up the likelihoods for that; instead, we have to determine the likelihoods and then update on the virtual-evidence proposition with those likelihoods.
That’s not quite what I had in mind, but I can see how my ‘continuously valued’ comment might have thrown you off. A more concrete example might help: consider Example 2 in this paper. It posits three events:
[...]
What I’m saying is that you can just treat z as being an event itself, and do a Bayesian update from the likelihood [...]
Hmmmm. Unfortunately I’m not sure what to say to this one except that in logical induction, there’s not generally a pre-existing z we can update on like that.
Take the example of calibration traders. These guys can be described as moving probabilities up and down to account for the calibration curves so far. But that doesn’t really mean that you “update on the calibration trader” and e.g. move your 90% probabilities down to 89% in response. Instead, what happens is that the system takes a fixed-point, accounting for the calibration traders and also everything else, finding a point where all the various influences balance out. This point becomes the next overall belief distribution.
So the actual update is a fixed-point calculation, which isn’t at all a nice formula such as multiplying all the forces pushing probabilities in different directions (finding the fixed point isn’t even a continuous function).
We can make it into a Bayesian update on 100%-confident evidence by modeling it as virtual evidence, but the virtual evidence is sorta pulled from nowhere, just an arbitrary thing that gets us from the old belief state to the new one. The actual calculation of the new belief state is, as I said, the big fixed point operation.
You can’t even say that we’re updating on all those forces pulling the distribution in different directions, because there are more than one fixed points of those forces. We don’t want to have uncertainty about which of those fixed points we end up in; that would give the wrong thing. So we really have to update on the fixed-point itself, which is already the answer to what to update to; not some information which we have pre-existing beliefs about and can figure out likelihood ratios for to figure out what to update to.
So that’s my real crux, and any examples with telephone calls and earthquakes etc are merely illustrative for me. (Like I said, I don’t know how to actually motivate any of this stuff except with actual logical uncertainty, and I’m surprised that any philosophers would have become convinced just from other sorts of examples.)
Hmmmm. Unfortunately I’m not sure what to say to this one except that in logical induction, there’s not generally a pre-existing z we can update on like that.
So that’s my real crux, and any examples with telephone calls and earthquakes etc are merely illustrative for me. (Like I said, I don’t know how to actually motivate any of this stuff except with actual logical uncertainty, and I’m surprised that any philosophers would have become convinced just from other sorts of examples.)
I agree that the logical induction case is different, since it’s hard to conceive of likelihoods to begin with. Basically, logical induction doesn’t even include what I would call virtual evidence. But many of the examples you gave do have such a z. I think I agree with your crux, and my main critique here is just in the examples of overly dogmatic Bayesian who refuses to acknowledge the difference between a and z. I won’t belabor the point further.
I’ve thought of another motivating example, BTW. In wartime, your enemy deliberately sends you some verifiably true information about their force dispositions. How should you update on that? You can’t use a Bayesian update, since you don’t actually have a likelihood model available. We can’t even attempt to learn a model from the information, since we can’t be sure its representative.
I don’t get this at all! What do you mean?
By model M, I mean an algorithm that generates likelihood functions, so M(H,Z) = P(Z|H).
So any time we talk about a likelihood P(Z|H), it should really read P(Z|H,M). We’ll posit that P(H,M) = P(H)P(M) (i.e. that the model says nothing about our priors), but this isn’t strictly necessary.
E(P(Z|H,M)) will be higher for a well calibrated model than a poorly calibrated model, which means that we expect P(H,M|Z) to also be higher. When we then marginalize over the models to get a final posterior on the hypothesis P(H|Z), it will be dominated by the well-calibrated models: P(H|Z) = SUM_i P(H|M_i,Z)P(M_i|Z).
BTW, I had a chance to read part of the ILA paper. It barely broke my brain at all! I wonder if the trick of enumerating traders and incorporating them over time could be repurposed to a more Bayesianish context, by instead enumerating models M. Like the trading firm in ILA, a meta-Bayesian algorithm could keep introducing new models M_k over time, with some intuition that the calibration of the best model in the set would improve over time, perhaps giving it all those nice anti-dutch book properties. Basically this is a computable Solomonoff induction, that slowly approaches completeness in the limit. (I’m pretty sure this is not an original idea. I wouldn’t be surprised if something like this contributed to the ILA itself).
Of course, its pretty unclear how this would work in the logical induction case. This might all be better explained in its own post.
Ah, now there’s a good question.
You’re right, you could have an event in the event space which is just “the virtua-evidence update [such-and-such]”. I’m actually going to pull out this trick in a future follow-up post.
I note that that’s not how Pearl or Jeffrey understand these updates. And it’s a peculiar thing to do—something happens to make you update a particular amount, but you’re just representing the event by the amount you update. Virtual evidence as-usually-understood at least coins a new symbol to represent the hard-to-articulate thing you’re updating on.
But that’s not much of an argument against the idea, especially when virtual evidence is already a peculiar practice.
Note that re-framing virtual evidence updates as Bayesian updates doesn’t change the fact that you’re now allowing essentially arbitrary updates rather than using Bayes’ Law to compute your update. You’re adjusting the formalism around an update to make Bayes’ Law true of it, not using Bayes’ Law to compute the update. So Bayes’ Law is still de-throned in the sense that it’s no longer the formula used to compute updates.
I’m going to focus on this even though you more or less concede the point later on in your reply. (By the way, I really appreciate your in-depth engagement with my position.)
In what sense? What technical claim about Bayesian updates are you trying to refer to?
One candidate is the dominance of Solomonoff induction over any computable machine learning technique. This has several problems:
Solomonoff induction only gets you this guarantee in the sequential prediction setting, whereas logical induction gets a qualitatively similar guarantee in a broader setting.
“sufficiently much computational power” is infinite, whereas the “sufficiently much computational power” for logical induction is finite. So logical induction would seem to get closer to offering real advice for bounded agents, rather than idealized decision theory whose application to bounded agents requires further insights.
Thus, logical induction is more suited to handling logical uncertainty, which is critical for handling bounded agents. Logical uncertainty becomes less confusing when non-Bayesian updates are allowed.
It can display irrational behaviors such as miscalibration, if grain-of-truth does not apply. These failures of rationality can be independently concerning, even given a nice guarantee about Bayes loss.
Note that I did not put the point “Solomonoff’s dominance property requires grain-of-truth” in this list, because it doesn’t—in a Bayes Loss sense, Solomonoff still does almost as well as any computable machine learning technique even in non-realizable settings where it may not converge to any one computable hypothesis. (Even though better calibration does tend to improve Bayes loss, we know it isn’t losing too much to this.)
Another candidate is that Bayes’ law is the optimal update policy when starting with a particular prior and trying to minimize Bayes loss. But again this has several problems.
This is only true if the only information we have coming in is a sequence of propositions which we are updating 100% on. Realistically, we can revise the prior to be better just by thinking about it (due to our logical uncertainty), making non-Bayesian updates relevant.
This optimality property only makes sense if we believe something like grain-of-truth. “I’ve listed all the possibilities in my prior, and I’m hedging against them according to my degree of belief in them. What more do you want from me?”
Rationality isn’t just about minimizing Bayes loss. Bayes loss is a compelling property in part because it gets us other intuitively compelling properties (such as Bayes’ Law). But properties such as calibration and convergence also have intuitive appeal, and we can back this appeal up with Dutch Book and Dutch-Book-like arguments.
Sadly, the actual machinery of logical induction was beyond the scope of this post, but there are answers. I just don’t yet know a good way to present it all as a nice, practical, intuitively appealing package.
That’s not quite what I had in mind, but I can see how my ‘continuously valued’ comment might have thrown you off. A more concrete example might help: consider Example 2 in this paper. It posits three events:
b—my house was burgled
a—my alarm went off
z—my neighbor calls to tell me the alarm went off
Pearl’s method is to take what would be uncertain information about a (via my model of my neighbor and the fact she called me) and transform it into virtual evidence (which includes the likelihood ratio). What I’m saying is that you can just treat z as being an event itself, and do a Bayesian update from the likelihood P(z|b)=P(z|a)P(a|b)+P(z|~a)P(~a|b), etc. This will give you the exact same posterior as Pearl. Really, the only difference in these formulations is that Pearl only needs to know the ratio P(z|a):P(z|~a), whereas traditional Bayesian update requires actual values. Of course, any set of values consistent with the ratio will produce the right answer.
The slightly more complex case (and why I mentioned continuous values) is in section 5 where the message includes probability data, such as a likelihood ratio. Note that the continuous value is not the amount you update (at least not generally), because its not generated from your own models, but rather by the messenger. Consider event z99, where my neighbor calls to say she’s 99% sure the alarm went off. This doesn’t mean I have to treat P(z99|b):P(z99|~b) as 99:1 - I might model my neighbor as being poorly calibrated (or as not being independent of other information I already have), and use some other ratio.
Definitely the second one, as optimal update policy. Responding to your specific objections:
As you’ll hopefully agree with at this point, we can always manufacture the 100% condition by turning it into virtual evidence.
I believe I previously conceded this point—the true hypothesis (or at least a ‘good enough’ one) must have a nonzero probability, which we can’t guarantee.
Re: calibration—I still believe that this can be included if you are jointly estimating your model and your hypothesis.
Re: convergence—how real of a problem is this? In your example you had two hypotheses that were precisely equally wrong. Does convergence still fail if the true probability is 0.500001 ?
Likewise! This has certainly been educational, especially in light of this:
The solution is too large to fit in the margins, eh? j/k, I know there’s a real paper. Should I go break my brain trying to read it, or wait for your explanation?
Oh, I definitely don’t have a better explanation of that in the works at this point.
My main concern here is if there’s an adversarial process taking advantage of a system, as in the trolling mathematicians work.
In the case of mathematical reasoning, though, the problem is quite severe—as is hopefully clear from what I’ve linked above, although an adversary can greatly exacerbate the problem, even a normal non-adversarial stream of evidence is going to keep flipping the probability up and down by non-negligible amounts. (And although the post offers a solution, it’s a pretty dumb prior, and I argue all priors which avoid this problem will be similarly dumb.)
I don’t get this at all! What do you mean?
As I discussed earlier, I agree, but with the major caveat that the likelihoods from the update aren’t found by first determining what you’re updating on, and then looking up the likelihoods for that; instead, we have to determine the likelihoods and then update on the virtual-evidence proposition with those likelihoods.
Hmmmm. Unfortunately I’m not sure what to say to this one except that in logical induction, there’s not generally a pre-existing z we can update on like that.
Take the example of calibration traders. These guys can be described as moving probabilities up and down to account for the calibration curves so far. But that doesn’t really mean that you “update on the calibration trader” and e.g. move your 90% probabilities down to 89% in response. Instead, what happens is that the system takes a fixed-point, accounting for the calibration traders and also everything else, finding a point where all the various influences balance out. This point becomes the next overall belief distribution.
So the actual update is a fixed-point calculation, which isn’t at all a nice formula such as multiplying all the forces pushing probabilities in different directions (finding the fixed point isn’t even a continuous function).
We can make it into a Bayesian update on 100%-confident evidence by modeling it as virtual evidence, but the virtual evidence is sorta pulled from nowhere, just an arbitrary thing that gets us from the old belief state to the new one. The actual calculation of the new belief state is, as I said, the big fixed point operation.
You can’t even say that we’re updating on all those forces pulling the distribution in different directions, because there are more than one fixed points of those forces. We don’t want to have uncertainty about which of those fixed points we end up in; that would give the wrong thing. So we really have to update on the fixed-point itself, which is already the answer to what to update to; not some information which we have pre-existing beliefs about and can figure out likelihood ratios for to figure out what to update to.
So that’s my real crux, and any examples with telephone calls and earthquakes etc are merely illustrative for me. (Like I said, I don’t know how to actually motivate any of this stuff except with actual logical uncertainty, and I’m surprised that any philosophers would have become convinced just from other sorts of examples.)
I agree that the logical induction case is different, since it’s hard to conceive of likelihoods to begin with. Basically, logical induction doesn’t even include what I would call virtual evidence. But many of the examples you gave do have such a z. I think I agree with your crux, and my main critique here is just in the examples of overly dogmatic Bayesian who refuses to acknowledge the difference between a and z. I won’t belabor the point further.
I’ve thought of another motivating example, BTW. In wartime, your enemy deliberately sends you some verifiably true information about their force dispositions. How should you update on that? You can’t use a Bayesian update, since you don’t actually have a likelihood model available. We can’t even attempt to learn a model from the information, since we can’t be sure its representative.
By model M, I mean an algorithm that generates likelihood functions, so M(H,Z) = P(Z|H).
So any time we talk about a likelihood P(Z|H), it should really read P(Z|H,M). We’ll posit that P(H,M) = P(H)P(M) (i.e. that the model says nothing about our priors), but this isn’t strictly necessary.
E(P(Z|H,M)) will be higher for a well calibrated model than a poorly calibrated model, which means that we expect P(H,M|Z) to also be higher. When we then marginalize over the models to get a final posterior on the hypothesis P(H|Z), it will be dominated by the well-calibrated models: P(H|Z) = SUM_i P(H|M_i,Z)P(M_i|Z).
BTW, I had a chance to read part of the ILA paper. It barely broke my brain at all! I wonder if the trick of enumerating traders and incorporating them over time could be repurposed to a more Bayesianish context, by instead enumerating models M. Like the trading firm in ILA, a meta-Bayesian algorithm could keep introducing new models M_k over time, with some intuition that the calibration of the best model in the set would improve over time, perhaps giving it all those nice anti-dutch book properties. Basically this is a computable Solomonoff induction, that slowly approaches completeness in the limit. (I’m pretty sure this is not an original idea. I wouldn’t be surprised if something like this contributed to the ILA itself).
Of course, its pretty unclear how this would work in the logical induction case. This might all be better explained in its own post.