I find it one of the more troubling outstanding issues with a number of proposals for AI alignment.
Which proposals? AFAIK Paul’s latest proposal no longer calls for imitating humans in a broad sense (i.e., including behavior that requires planning), but only imitating a small subset of the human policy which hopefully can be learned exactly correctly. See this comment where he wrote:
Are you perhaps assuming that we can max out regret during training for the agents that have to be trained with human involvement, but not necessarily for the higher level agents?
Yeah, I’m relatively optimistic that it’s possible to learn enough from humans that the lower level agent remains universal (+ aligned etc.) on arbitrary distributions. This would probably be the case if you managed to consistently break queries down into simpler pieces until arriving at a very simple queries.
ETA: Oh, but the same kind of problems you’re point out here would still apply at the higher level distillation steps. I think the idea there is for an (amplified) overseer to look inside the imitator / distilled agent during training to push it away from doing anything malign/incorrigible (as Jessica also mentioned). Here is a post where Paul talked about this.
See the clarifying note in the OP. I don’t think this is about imitating humans, per se.
The more general framing I’d use is WRT “safety via myopia” (something I’ve been working on in the past year). There is an intuition that supervised learning (e.g. via SGD as is common practice in current ML) is quite safe, because it doesn’t have any built-in incentive to influence the world (resulting in instrumental goals); it just seeks to yield good performance on the training data, learning in a myopic sense to improve it’s performance on the present input. I think this intuition has some validity, but also might lead to a false sense of confidence that such systems are safe, when in fact they may end up behaving as if they *do* seek to influence the world, depending on the task they are trained on (ETA: and other details of the learning algorithm, e.g. outer-loop optimization and model choice).
See the clarifying note in the OP. I don’t think this is about imitating humans, per se.
Yes, I realized that after I wrote my original comment, so I added the “ETA” part.
I think this intuition has some validity, but also might lead to a false sense of confidence that such systems are safe, when in fact they may end up behaving as if they do seek to influence the world, depending on the task they are trained on (ETA: and other details of the learning algorithm, e.g. outer-loop optimization and model choice).
I think this makes sense and at least some have also realized this and have reacted appropriately within their agenda (see the “ETA” part of my earlier comment). It also seems good that you’re calling it out as a general issue. I’d still suggest giving some examples of AI alignment proposals where people haven’t realized this, to help illustrate your point.
Which proposals? AFAIK Paul’s latest proposal no longer calls for imitating humans in a broad sense (i.e., including behavior that requires planning), but only imitating a small subset of the human policy which hopefully can be learned exactly correctly. See this comment where he wrote:
ETA: Oh, but the same kind of problems you’re point out here would still apply at the higher level distillation steps. I think the idea there is for an (amplified) overseer to look inside the imitator / distilled agent during training to push it away from doing anything malign/incorrigible (as Jessica also mentioned). Here is a post where Paul talked about this.
See the clarifying note in the OP. I don’t think this is about imitating humans, per se.
The more general framing I’d use is WRT “safety via myopia” (something I’ve been working on in the past year). There is an intuition that supervised learning (e.g. via SGD as is common practice in current ML) is quite safe, because it doesn’t have any built-in incentive to influence the world (resulting in instrumental goals); it just seeks to yield good performance on the training data, learning in a myopic sense to improve it’s performance on the present input. I think this intuition has some validity, but also might lead to a false sense of confidence that such systems are safe, when in fact they may end up behaving as if they *do* seek to influence the world, depending on the task they are trained on (ETA: and other details of the learning algorithm, e.g. outer-loop optimization and model choice).
Yes, I realized that after I wrote my original comment, so I added the “ETA” part.
I think this makes sense and at least some have also realized this and have reacted appropriately within their agenda (see the “ETA” part of my earlier comment). It also seems good that you’re calling it out as a general issue. I’d still suggest giving some examples of AI alignment proposals where people haven’t realized this, to help illustrate your point.