This requires it to be not that hard, however, to find a model with a notion of the human’s short-term preferences as opposed to their long-term preferences that is also willing to correct that notion based on feedback.
I’m curious if you have a better understanding of “short-term preferences” than I do. I’m not sure what a definition of it could be, however from earlier writings of Paul I guess it’s things like “gain resources” and “keep me in control”. But it seems like a human might have really bad/wrong understandings of what constitute “resources” (e.g., I may not realize that something is a really valuable resource for achieving my long-term goals so it’s not part of my short-term preferences to get more of it) and “control” (if I listen to some argument on the Internet or from my AI, and have my mind changed by it, maybe I won’t be in control anymore but I don’t realize this) so it’s hard for me to see how having an AI optimize for my short-term preferences will lead to reaching my long-term goals, especially in a competitive environment with other AIs around.
So I would be interested to see:
a definition of “short-term preferences”
a (verifiable) mechanism by which an AI/model could learn just the short-term preferences of a human, or distinguish between short-term and long-term preferences after learning both
an explanation of why optimizing for humans’ own understanding of short-term preferences is good enough for avoiding x-risk
Another way to explain my doubt on 3 is, if other AIs are optimizing for a superhuman understanding of what short-term preferences will lead to long-term goals, and my AI is only optimizing for a human understanding of that, how is my AI going to be competitive?
I’m curious if you have a better understanding of “short-term preferences” than I do. I’m not sure what a definition of it could be, however from earlier writings of Paul I guess it’s things like “gain resources” and “keep me in control”. But it seems like a human might have really bad/wrong understandings of what constitute “resources” (e.g., I may not realize that something is a really valuable resource for achieving my long-term goals so it’s not part of my short-term preferences to get more of it) and “control” (if I listen to some argument on the Internet or from my AI, and have my mind changed by it, maybe I won’t be in control anymore but I don’t realize this) so it’s hard for me to see how having an AI optimize for my short-term preferences will lead to reaching my long-term goals, especially in a competitive environment with other AIs around.
So I would be interested to see:
a definition of “short-term preferences”
a (verifiable) mechanism by which an AI/model could learn just the short-term preferences of a human, or distinguish between short-term and long-term preferences after learning both
an explanation of why optimizing for humans’ own understanding of short-term preferences is good enough for avoiding x-risk
Another way to explain my doubt on 3 is, if other AIs are optimizing for a superhuman understanding of what short-term preferences will lead to long-term goals, and my AI is only optimizing for a human understanding of that, how is my AI going to be competitive?
ETA: See also Strategic implications of AIs’ ability to coordinate at low cost which is another way that a corrigible AI may not be competitive. Would be interested in your thoughts on that topic as well.