See the clarifying note in the OP. I don’t think this is about imitating humans, per se.
The more general framing I’d use is WRT “safety via myopia” (something I’ve been working on in the past year). There is an intuition that supervised learning (e.g. via SGD as is common practice in current ML) is quite safe, because it doesn’t have any built-in incentive to influence the world (resulting in instrumental goals); it just seeks to yield good performance on the training data, learning in a myopic sense to improve it’s performance on the present input. I think this intuition has some validity, but also might lead to a false sense of confidence that such systems are safe, when in fact they may end up behaving as if they *do* seek to influence the world, depending on the task they are trained on (ETA: and other details of the learning algorithm, e.g. outer-loop optimization and model choice).
See the clarifying note in the OP. I don’t think this is about imitating humans, per se.
Yes, I realized that after I wrote my original comment, so I added the “ETA” part.
I think this intuition has some validity, but also might lead to a false sense of confidence that such systems are safe, when in fact they may end up behaving as if they do seek to influence the world, depending on the task they are trained on (ETA: and other details of the learning algorithm, e.g. outer-loop optimization and model choice).
I think this makes sense and at least some have also realized this and have reacted appropriately within their agenda (see the “ETA” part of my earlier comment). It also seems good that you’re calling it out as a general issue. I’d still suggest giving some examples of AI alignment proposals where people haven’t realized this, to help illustrate your point.
See the clarifying note in the OP. I don’t think this is about imitating humans, per se.
The more general framing I’d use is WRT “safety via myopia” (something I’ve been working on in the past year). There is an intuition that supervised learning (e.g. via SGD as is common practice in current ML) is quite safe, because it doesn’t have any built-in incentive to influence the world (resulting in instrumental goals); it just seeks to yield good performance on the training data, learning in a myopic sense to improve it’s performance on the present input. I think this intuition has some validity, but also might lead to a false sense of confidence that such systems are safe, when in fact they may end up behaving as if they *do* seek to influence the world, depending on the task they are trained on (ETA: and other details of the learning algorithm, e.g. outer-loop optimization and model choice).
Yes, I realized that after I wrote my original comment, so I added the “ETA” part.
I think this makes sense and at least some have also realized this and have reacted appropriately within their agenda (see the “ETA” part of my earlier comment). It also seems good that you’re calling it out as a general issue. I’d still suggest giving some examples of AI alignment proposals where people haven’t realized this, to help illustrate your point.