The solution is too large to fit in the margins, eh? j/k, I know there’s a real paper. Should I go break my brain trying to read it, or wait for your explanation?
Oh, I definitely don’t have a better explanation of that in the works at this point.
Re: convergence—how real of a problem is this? In your example you had two hypotheses that were precisely equally wrong. Does convergence still fail if the true probability is 0.500001 ?
My main concern here is if there’s an adversarial process taking advantage of a system, as in the trolling mathematicians work.
In the case of mathematical reasoning, though, the problem is quite severe—as is hopefully clear from what I’ve linked above, although an adversary can greatly exacerbate the problem, even a normal non-adversarial stream of evidence is going to keep flipping the probability up and down by non-negligible amounts. (And although the post offers a solution, it’s a pretty dumb prior, and I argue all priors which avoid this problem will be similarly dumb.)
Re: calibration—I still believe that this can be included if you are jointly estimating your model and your hypothesis.
I don’t get this at all! What do you mean?
As you’ll hopefully agree with at this point, we can always manufacture the 100% condition by turning it into virtual evidence.
As I discussed earlier, I agree, but with the major caveat that the likelihoods from the update aren’t found by first determining what you’re updating on, and then looking up the likelihoods for that; instead, we have to determine the likelihoods and then update on the virtual-evidence proposition with those likelihoods.
That’s not quite what I had in mind, but I can see how my ‘continuously valued’ comment might have thrown you off. A more concrete example might help: consider Example 2 in this paper. It posits three events:
[...]
What I’m saying is that you can just treat z as being an event itself, and do a Bayesian update from the likelihood [...]
Hmmmm. Unfortunately I’m not sure what to say to this one except that in logical induction, there’s not generally a pre-existing z we can update on like that.
Take the example of calibration traders. These guys can be described as moving probabilities up and down to account for the calibration curves so far. But that doesn’t really mean that you “update on the calibration trader” and e.g. move your 90% probabilities down to 89% in response. Instead, what happens is that the system takes a fixed-point, accounting for the calibration traders and also everything else, finding a point where all the various influences balance out. This point becomes the next overall belief distribution.
So the actual update is a fixed-point calculation, which isn’t at all a nice formula such as multiplying all the forces pushing probabilities in different directions (finding the fixed point isn’t even a continuous function).
We can make it into a Bayesian update on 100%-confident evidence by modeling it as virtual evidence, but the virtual evidence is sorta pulled from nowhere, just an arbitrary thing that gets us from the old belief state to the new one. The actual calculation of the new belief state is, as I said, the big fixed point operation.
You can’t even say that we’re updating on all those forces pulling the distribution in different directions, because there are more than one fixed points of those forces. We don’t want to have uncertainty about which of those fixed points we end up in; that would give the wrong thing. So we really have to update on the fixed-point itself, which is already the answer to what to update to; not some information which we have pre-existing beliefs about and can figure out likelihood ratios for to figure out what to update to.
So that’s my real crux, and any examples with telephone calls and earthquakes etc are merely illustrative for me. (Like I said, I don’t know how to actually motivate any of this stuff except with actual logical uncertainty, and I’m surprised that any philosophers would have become convinced just from other sorts of examples.)
Hmmmm. Unfortunately I’m not sure what to say to this one except that in logical induction, there’s not generally a pre-existing z we can update on like that.
So that’s my real crux, and any examples with telephone calls and earthquakes etc are merely illustrative for me. (Like I said, I don’t know how to actually motivate any of this stuff except with actual logical uncertainty, and I’m surprised that any philosophers would have become convinced just from other sorts of examples.)
I agree that the logical induction case is different, since it’s hard to conceive of likelihoods to begin with. Basically, logical induction doesn’t even include what I would call virtual evidence. But many of the examples you gave do have such a z. I think I agree with your crux, and my main critique here is just in the examples of overly dogmatic Bayesian who refuses to acknowledge the difference between a and z. I won’t belabor the point further.
I’ve thought of another motivating example, BTW. In wartime, your enemy deliberately sends you some verifiably true information about their force dispositions. How should you update on that? You can’t use a Bayesian update, since you don’t actually have a likelihood model available. We can’t even attempt to learn a model from the information, since we can’t be sure its representative.
I don’t get this at all! What do you mean?
By model M, I mean an algorithm that generates likelihood functions, so M(H,Z) = P(Z|H).
So any time we talk about a likelihood P(Z|H), it should really read P(Z|H,M). We’ll posit that P(H,M) = P(H)P(M) (i.e. that the model says nothing about our priors), but this isn’t strictly necessary.
E(P(Z|H,M)) will be higher for a well calibrated model than a poorly calibrated model, which means that we expect P(H,M|Z) to also be higher. When we then marginalize over the models to get a final posterior on the hypothesis P(H|Z), it will be dominated by the well-calibrated models: P(H|Z) = SUM_i P(H|M_i,Z)P(M_i|Z).
BTW, I had a chance to read part of the ILA paper. It barely broke my brain at all! I wonder if the trick of enumerating traders and incorporating them over time could be repurposed to a more Bayesianish context, by instead enumerating models M. Like the trading firm in ILA, a meta-Bayesian algorithm could keep introducing new models M_k over time, with some intuition that the calibration of the best model in the set would improve over time, perhaps giving it all those nice anti-dutch book properties. Basically this is a computable Solomonoff induction, that slowly approaches completeness in the limit. (I’m pretty sure this is not an original idea. I wouldn’t be surprised if something like this contributed to the ILA itself).
Of course, its pretty unclear how this would work in the logical induction case. This might all be better explained in its own post.
Oh, I definitely don’t have a better explanation of that in the works at this point.
My main concern here is if there’s an adversarial process taking advantage of a system, as in the trolling mathematicians work.
In the case of mathematical reasoning, though, the problem is quite severe—as is hopefully clear from what I’ve linked above, although an adversary can greatly exacerbate the problem, even a normal non-adversarial stream of evidence is going to keep flipping the probability up and down by non-negligible amounts. (And although the post offers a solution, it’s a pretty dumb prior, and I argue all priors which avoid this problem will be similarly dumb.)
I don’t get this at all! What do you mean?
As I discussed earlier, I agree, but with the major caveat that the likelihoods from the update aren’t found by first determining what you’re updating on, and then looking up the likelihoods for that; instead, we have to determine the likelihoods and then update on the virtual-evidence proposition with those likelihoods.
Hmmmm. Unfortunately I’m not sure what to say to this one except that in logical induction, there’s not generally a pre-existing z we can update on like that.
Take the example of calibration traders. These guys can be described as moving probabilities up and down to account for the calibration curves so far. But that doesn’t really mean that you “update on the calibration trader” and e.g. move your 90% probabilities down to 89% in response. Instead, what happens is that the system takes a fixed-point, accounting for the calibration traders and also everything else, finding a point where all the various influences balance out. This point becomes the next overall belief distribution.
So the actual update is a fixed-point calculation, which isn’t at all a nice formula such as multiplying all the forces pushing probabilities in different directions (finding the fixed point isn’t even a continuous function).
We can make it into a Bayesian update on 100%-confident evidence by modeling it as virtual evidence, but the virtual evidence is sorta pulled from nowhere, just an arbitrary thing that gets us from the old belief state to the new one. The actual calculation of the new belief state is, as I said, the big fixed point operation.
You can’t even say that we’re updating on all those forces pulling the distribution in different directions, because there are more than one fixed points of those forces. We don’t want to have uncertainty about which of those fixed points we end up in; that would give the wrong thing. So we really have to update on the fixed-point itself, which is already the answer to what to update to; not some information which we have pre-existing beliefs about and can figure out likelihood ratios for to figure out what to update to.
So that’s my real crux, and any examples with telephone calls and earthquakes etc are merely illustrative for me. (Like I said, I don’t know how to actually motivate any of this stuff except with actual logical uncertainty, and I’m surprised that any philosophers would have become convinced just from other sorts of examples.)
I agree that the logical induction case is different, since it’s hard to conceive of likelihoods to begin with. Basically, logical induction doesn’t even include what I would call virtual evidence. But many of the examples you gave do have such a z. I think I agree with your crux, and my main critique here is just in the examples of overly dogmatic Bayesian who refuses to acknowledge the difference between a and z. I won’t belabor the point further.
I’ve thought of another motivating example, BTW. In wartime, your enemy deliberately sends you some verifiably true information about their force dispositions. How should you update on that? You can’t use a Bayesian update, since you don’t actually have a likelihood model available. We can’t even attempt to learn a model from the information, since we can’t be sure its representative.
By model M, I mean an algorithm that generates likelihood functions, so M(H,Z) = P(Z|H).
So any time we talk about a likelihood P(Z|H), it should really read P(Z|H,M). We’ll posit that P(H,M) = P(H)P(M) (i.e. that the model says nothing about our priors), but this isn’t strictly necessary.
E(P(Z|H,M)) will be higher for a well calibrated model than a poorly calibrated model, which means that we expect P(H,M|Z) to also be higher. When we then marginalize over the models to get a final posterior on the hypothesis P(H|Z), it will be dominated by the well-calibrated models: P(H|Z) = SUM_i P(H|M_i,Z)P(M_i|Z).
BTW, I had a chance to read part of the ILA paper. It barely broke my brain at all! I wonder if the trick of enumerating traders and incorporating them over time could be repurposed to a more Bayesianish context, by instead enumerating models M. Like the trading firm in ILA, a meta-Bayesian algorithm could keep introducing new models M_k over time, with some intuition that the calibration of the best model in the set would improve over time, perhaps giving it all those nice anti-dutch book properties. Basically this is a computable Solomonoff induction, that slowly approaches completeness in the limit. (I’m pretty sure this is not an original idea. I wouldn’t be surprised if something like this contributed to the ILA itself).
Of course, its pretty unclear how this would work in the logical induction case. This might all be better explained in its own post.