Typo: in the first full paragraph of page 2, I assume you mean the agent will one-box, not two-box.
And I’m not sure the final algorithm necessarily one-boxes even if the logical uncertainty engine thinks the predictor’s (stronger) axioms are probably consistent- I think there might be a spurious counterfactual where the conditional utilities view the agent two-boxing as evidence that the predictor’s axioms must be inconsistent. Is there a clean proof that the algorithm does the correct thing in this case?
Typo: in the first full paragraph of page 2, I assume you mean the agent will one-box, not two-box.
Yes, thanks for the correction. I’d fix it, but I don’t think it’s possible to edit a pdf in google drive, and it’t not worth re-uploading and posting a new link for a typo.
And I’m not sure the final algorithm necessarily one-boxes even if the logical uncertainty engine thinks the predictor’s (stronger) axioms are probably consistent- I think there might be a spurious counterfactual where the conditional utilities view the agent two-boxing as evidence that the predictor’s axioms must be inconsistent. Is there a clean proof that the algorithm does the correct thing in this case?
I don’t have such a proof. I mentioned that as a possible concern at the end of the second-last paragraph of the section on the predictor having stronger logic and more computing power. Reconsidering though, this seems like a more serious concern than I initially imagined. It seems this will behave reasonably only when the agent does not trust itself too much, which would have terrible consequences for problems involving sequential decision-making.
Ideally, we’d want to replace the conditional expected value function with something of a more counterfactual nature to avoid these sorts of issues, but I don’t have a coherent way of specifying what that would even mean.
I think you mean “a spurious counterfactual where the conditional utilities view the agent one-boxing as evidence that the predictor’s axioms must be inconsistent”? That is, the agent correctly believes that predictor’s axioms are likely to be consistent but also thinks that they would be inconsistent if it one-boxed, so it two-boxes?
[Edit: this isn’t actually a spurious counterfactual.] The agent might reason “if I two-box, then either it’s because I do something stupid (we can’t rule this out for Lobian reasons, but we should be able to assign it arbitrarily low probability), or, much more likely, the predictor’s reasoning is inconsistent. An inconsistent predictor would put $1M in box B no matter what my action is, so I can get $1,001,000 by two-boxing in this scenario. I am sufficiently confident in this model that my expected payoff conditional on me two-boxing is greater than $1M, whereas I can’t possibly get more than $1M if I one-box. Therefore I should two-box.” (this only happens if the predictor is implemented in such a way that it puts $1M in box B if it is inconsistent, of course). If the agent reasons this way, it would be wrong to trust itself with high probability, but we’d want the agent to be able to trust itself with high probability without being wrong.
Nice!
Typo: in the first full paragraph of page 2, I assume you mean the agent will one-box, not two-box.
And I’m not sure the final algorithm necessarily one-boxes even if the logical uncertainty engine thinks the predictor’s (stronger) axioms are probably consistent- I think there might be a spurious counterfactual where the conditional utilities view the agent two-boxing as evidence that the predictor’s axioms must be inconsistent. Is there a clean proof that the algorithm does the correct thing in this case?
Yes, thanks for the correction. I’d fix it, but I don’t think it’s possible to edit a pdf in google drive, and it’t not worth re-uploading and posting a new link for a typo.
I don’t have such a proof. I mentioned that as a possible concern at the end of the second-last paragraph of the section on the predictor having stronger logic and more computing power. Reconsidering though, this seems like a more serious concern than I initially imagined. It seems this will behave reasonably only when the agent does not trust itself too much, which would have terrible consequences for problems involving sequential decision-making.
Ideally, we’d want to replace the conditional expected value function with something of a more counterfactual nature to avoid these sorts of issues, but I don’t have a coherent way of specifying what that would even mean.
I think you mean “a spurious counterfactual where the conditional utilities view the agent one-boxing as evidence that the predictor’s axioms must be inconsistent”? That is, the agent correctly believes that predictor’s axioms are likely to be consistent but also thinks that they would be inconsistent if it one-boxed, so it two-boxes?
[Edit: this isn’t actually a spurious counterfactual.] The agent might reason “if I two-box, then either it’s because I do something stupid (we can’t rule this out for Lobian reasons, but we should be able to assign it arbitrarily low probability), or, much more likely, the predictor’s reasoning is inconsistent. An inconsistent predictor would put $1M in box B no matter what my action is, so I can get $1,001,000 by two-boxing in this scenario. I am sufficiently confident in this model that my expected payoff conditional on me two-boxing is greater than $1M, whereas I can’t possibly get more than $1M if I one-box. Therefore I should two-box.” (this only happens if the predictor is implemented in such a way that it puts $1M in box B if it is inconsistent, of course). If the agent reasons this way, it would be wrong to trust itself with high probability, but we’d want the agent to be able to trust itself with high probability without being wrong.