Rohin Shah comments on Policy Alignment

Rohin Shah 14 Jul 2018 17:47 UTC
LW: 1 AF: 1
AF
Separately, I still don’t understand the counterfactual mugging case. (Disclaimer, I haven’t gone through any math around counterfactual mugging.) It seems really strange that if the human was certain about the digit, they wouldn’t pay up, but if the human is uncertain about the digit but is certain that the AI knows the digit, then the human would not want the AI to intervene. But possibly it’s not worth getting into this detail.
Omega will put either $10 or $1000 in a box. Our AI can press a button on the box to get either all or half of the money inside. Omega puts in $1000 if it predicts that our AI will take half the money; otherwise, it puts in $10.
We suppose that, since there is a short proof of exactly what Omega does, it is already present in the mathematical database included in the AI’s prior.
If the AI is a value-learning agent, it will take all the money, since it already knows how much money there is—taking less money just has a lower expected utility. So, it will get only $10 from Omega.
If the AI is a policy-approval agent, it will think about what would have a higher expectation in the human’s expectation: taking half, or taking it all. It’s quite possible in this case that it takes all the money.
I think assuming that you have access to the proof of what Omega does means that you have already determined your own behavior. Presumably, “what Omega does” depends on your own policy, so if you have a proof about what Omega does, that proof also determines your action, and there is nothing left for the agent to consider.
To be clear, I think it’s reasonable to consider AIs that try to figure out proofs of “what Omega does”, but if that’s taken to be _part of the prior_, then it seems you no longer have the chance to (acausally) influence what Omega does. And if it’s not part of the prior, then I think a value-learning agent with a good decision theory can get the $500.
- abramdemski 18 Jul 2018 22:17 UTC
  LW: 2 AF: 1
  AF Parent
  I think assuming that you have access to the proof of what Omega does means that you have already determined your own behavior.
  You may not recognize it as such, especially if Omega is using a different axiom system than you. So, you can still be ignorant of what you’ll do while knowing what Omega’s prediction of you is. This makes it impossible for your probability distribution to treat the two as correlated anymore.
  but if that’s taken to be _part of the prior_, then it seems you no longer have the chance to (acausally) influence what Omega does
  Yeah, that’s the problem here.
  And if it’s not part of the prior, then I think a value-learning agent with a good decision theory can get the $500.
  Only if the agent takes that one proof out of the prior, but still has enough structure in the prior to see how the decision problem plays out. This is the problem of constructing a thin prior. You can (more or less) solve any decision problem by making the agent sufficiently updateless, but you run up against the problem of making it too updateless, at which point it behaves in absurd ways (lacking enough structure to even understand the consequences of policies correctly).
  Hence the intuition that the correct prior to be updateless with respect to is the human one (which is, essentially, the main point of the post).