By “just thinking about IRL”, do you mean “just thinking about the robot using IRL to learn what humans want”? ’Coz that isn’t alignment.
‘But potentially a problem with more abstract cashings-out of the idea “learn human values and then want that”’ is what I’m talking about, yes. But it also seems to be what you’re talking about in your last paragraph.
“Human wants cookie” is not a full-enough understanding of what the human really wants, and under what conditions, to take intelligent actions to help the human. A robot learning that would act like a paper-clipper, but with cookies. It isn’t clear whether a robot which hasn’t resolved the de dicto / de re / de se distinction in what the human wants will be able to do more good than harm in trying to satisfy human desires, nor what will happen if a robot learns that humans are using de se justifications.
Here’s another way of looking at that “nor what will happen if” clause: We’ve been casually tossing about the phrase “learn human values” for a long time, but that isn’t what the people who say that want. If AI learned human values, it would treat humans the way humans treat cattle. But if the AI is to learn to desire to help humans satisfy their wants, it isn’t clear that the AI can (A) internalize human values enough to understand and effectively optimize for them, while at the same time (B) keeping those values compartmentalized from its own values, which make it enjoy helping humans with their problems. To do that the AI would need to want to propagate and support human values that it disagrees with. It isn’t clear that that’s something a coherent, let’s say “rational”, agent can do.
By “just thinking about IRL”, do you mean “just thinking about the robot using IRL to learn what humans want”? ’Coz that isn’t alignment.
‘But potentially a problem with more abstract cashings-out of the idea “learn human values and then want that”’ is what I’m talking about, yes. But it also seems to be what you’re talking about in your last paragraph.
“Human wants cookie” is not a full-enough understanding of what the human really wants, and under what conditions, to take intelligent actions to help the human. A robot learning that would act like a paper-clipper, but with cookies. It isn’t clear whether a robot which hasn’t resolved the de dicto / de re / de se distinction in what the human wants will be able to do more good than harm in trying to satisfy human desires, nor what will happen if a robot learns that humans are using de se justifications.
Here’s another way of looking at that “nor what will happen if” clause: We’ve been casually tossing about the phrase “learn human values” for a long time, but that isn’t what the people who say that want. If AI learned human values, it would treat humans the way humans treat cattle. But if the AI is to learn to desire to help humans satisfy their wants, it isn’t clear that the AI can (A) internalize human values enough to understand and effectively optimize for them, while at the same time (B) keeping those values compartmentalized from its own values, which make it enjoy helping humans with their problems. To do that the AI would need to want to propagate and support human values that it disagrees with. It isn’t clear that that’s something a coherent, let’s say “rational”, agent can do.