jbash comments on Policy Alignment

jbash 1 Jul 2018 20:25 UTC
3 points
0
“Ignoring issues of irrationality or bounded rationality, what an agent wants out of a helper agent is that the helper agent does preferred things.”
I don’t want a “helper agent” to do what I think I’d prefer it to do. I mean, I REALLY don’t want that or anything like that.
If I wanted that, I could just set it up to follow orders to the best of its understanding, and then order it around. The whole point is to make use of the fact that it’s smarter than I am and can achieve outcomes I can’t foresee in ways I can’t think up.
What I intuitively want it to do is what makes me happiest with the state of the world after it’s done it. That particular formulation may get hairy with cases where its actions alter my preferences, but just abandoning every possible improvement in favor of my pre-existing guesses about desirable actions isn’t a satisfactory answer.
- abramdemski 13 Jul 2018 1:10 UTC
  2 points
  0
  Parent
  If I wanted that, I could just set it up to follow orders to the best of its understanding, and then order it around. The whole point is to make use of the fact that it’s smarter than I am and can achieve outcomes I can’t foresee in ways I can’t think up.
  The AI here can do things which you wouldn’t think up.
  For example, it could have more computational power than you to search for plans which maximize expected utility according to your probability and utility functions. Then, it could tell you the answer, if you’re the kind of person who likes to be told those kinds of answers (IE, if this doesn’t violate your sense of autonomy/self-determination).
  Or, if there is any algorithm $P^{'}$ whose beliefs you trust more than your own, or would trust more than your own if some conditions held (which the AI can itself check), then the AI can optimize your utility function under expected value under $P^{'}$ rather than under your own beliefs, since you would prefer that.
  - Charlie Steiner 19 Jul 2018 1:01 UTC
    1 point
    0
    Parent
    
    For example, it could have more computational power than you to search for plans which maximize expected utility according to your probability and utility functions. Then, it could tell you the answer
    
    Would it, though? It’s not evaluating actions on my future probutility, otherwise it would wirehead me. It’s evaluating actions on my present probutility. So now the answer seems to depend on whether we allow “tell me the right answer” as a primitive action, or if it is evaluated as “tell me [String],” which has low probutility.
    
    But of course, if tell me the right answer is primitive, how do we stop “do the right thing” from being primitive, which lands us right back in the hot water of strong optimization of ‘utility’ this proposal was supposed to prevent? So I think it should evaluate the specific output, which has low probability(human), and therefore not tell you.
    - abramdemski 19 Jul 2018 17:01 UTC
      2 points
      0
      Parent
      I’ll try and write up a proof that it can do what I think it can.