abramdemski comments on Policy Alignment

abramdemski 13 Jul 2018 1:10 UTC
2 points
If I wanted that, I could just set it up to follow orders to the best of its understanding, and then order it around. The whole point is to make use of the fact that it’s smarter than I am and can achieve outcomes I can’t foresee in ways I can’t think up.
The AI here can do things which you wouldn’t think up.
For example, it could have more computational power than you to search for plans which maximize expected utility according to your probability and utility functions. Then, it could tell you the answer, if you’re the kind of person who likes to be told those kinds of answers (IE, if this doesn’t violate your sense of autonomy/self-determination).
Or, if there is any algorithm $P^{'}$ whose beliefs you trust more than your own, or would trust more than your own if some conditions held (which the AI can itself check), then the AI can optimize your utility function under expected value under $P^{'}$ rather than under your own beliefs, since you would prefer that.
- Charlie Steiner 19 Jul 2018 1:01 UTC
  1 point
  Parent
  
  For example, it could have more computational power than you to search for plans which maximize expected utility according to your probability and utility functions. Then, it could tell you the answer
  
  Would it, though? It’s not evaluating actions on my future probutility, otherwise it would wirehead me. It’s evaluating actions on my present probutility. So now the answer seems to depend on whether we allow “tell me the right answer” as a primitive action, or if it is evaluated as “tell me [String],” which has low probutility.
  
  But of course, if tell me the right answer is primitive, how do we stop “do the right thing” from being primitive, which lands us right back in the hot water of strong optimization of ‘utility’ this proposal was supposed to prevent? So I think it should evaluate the specific output, which has low probability(human), and therefore not tell you.
  - abramdemski 19 Jul 2018 17:01 UTC
    2 points
    Parent
    I’ll try and write up a proof that it can do what I think it can.