Wei Dai comments on Towards a mechanistic understanding of corrigibility

Wei Dai 31 Aug 2019 13:25 UTC
LW: 14 AF: 9
AF
This requires it to be not that hard, however, to find a model with a notion of the human’s short-term preferences as opposed to their long-term preferences that is also willing to correct that notion based on feedback.

I’m curious if you have a better understanding of “short-term preferences” than I do. I’m not sure what a definition of it could be, however from earlier writings of Paul I guess it’s things like “gain resources” and “keep me in control”. But it seems like a human might have really bad/wrong understandings of what constitute “resources” (e.g., I may not realize that something is a really valuable resource for achieving my long-term goals so it’s not part of my short-term preferences to get more of it) and “control” (if I listen to some argument on the Internet or from my AI, and have my mind changed by it, maybe I won’t be in control anymore but I don’t realize this) so it’s hard for me to see how having an AI optimize for my short-term preferences will lead to reaching my long-term goals, especially in a competitive environment with other AIs around.

So I would be interested to see:
1. a definition of “short-term preferences”
2. a (verifiable) mechanism by which an AI/model could learn just the short-term preferences of a human, or distinguish between short-term and long-term preferences after learning both
3. an explanation of why optimizing for humans’ own understanding of short-term preferences is good enough for avoiding x-risk
Another way to explain my doubt on 3 is, if other AIs are optimizing for a superhuman understanding of what short-term preferences will lead to long-term goals, and my AI is only optimizing for a human understanding of that, how is my AI going to be competitive?

ETA: See also Strategic implications of AIs’ ability to coordinate at low cost which is another way that a corrigible AI may not be competitive. Would be interested in your thoughts on that topic as well.
What links here?
- Wei Dai's comment on The strategy-stealing assumption by paulfchristiano (22 Sep 2019 22:04 UTC; 2 points)