We want an alternative to optimization which is robust to misspecified utility functions. A Bayesian approach might introduce a probability distribution over possible utility functions, and maximize expected utility with respect to that uncertainty. This doesn’t do much to increase our confidence in the outcome; we’ve only pushed the problem back to correctly specifying our uncertainty over the utility distribution, and problems from over-optimizing a misspecified function seem just about as likely. Certainly we get no new formal guarantees.
So, instead, we model the situation by supposing that an adversary has some bounded amount of power to deceive you about what your true utility function is. The adversary might concentrate all of this on one point which you’ll be very mistaken about (perhaps making a very bad idea look very good), or spread it out across a number of possibilities (making many good ideas look a little worse), or something inbetween. Under this assumption, a policy which randomizes actions somewhat rather than taking the max-expected-utility action is effective. This gives you some solid guarantees against utility misspecification, unlike the naive Bayesian approach. There’s still more to be desired, but this is a clear improvement.
It’s not clear to me that this actually gains anything. I’d expect that adequately parameterizing a class of adversaries to defend against isn’t much easier than adequately parameterizing a class of utility functions to be uncertain over.
Calibration means that the beliefs can be treated as frequencies
Does logical induction have a calibration result? I know it has a self-trust result that basically says it believes itself to be calibrated, but I’m not aware of a result saying that logical inductors actually are calibrated. For that matter, I’m not even sure how such a result would be precisely stated. [Edit: It turns out there are calibration results, in section 4.3 of the logical induction paper.]
It’s not clear to me that this actually gains anything. I’d expect that adequately parameterizing a class of adversaries to defend against isn’t much easier than adequately parameterizing a class of utility functions to be uncertain over.
Does logical induction have a calibration result? I know it has a self-trust result that basically says it believes itself to be calibrated, but I’m not aware of a result saying that logical inductors actually are calibrated. For that matter, I’m not even sure how such a result would be precisely stated. [Edit: It turns out there are calibration results, in section 4.3 of the logical induction paper.]