My take is that AGI learning human preferences/values/needs/desires/goals/etc. is a necessary but not sufficient condition for achieving alignment.
If AI learns our preferences base on our behaviors it’s going to learn a lot of “bad” things like lying, stealing and cheating and other much worse things.
First of all, it’s important to note that just because the AGI learns that humans engage in certain behaviors, that by itself does not imply that it will have any inclination to engage in those behaviors itself. Secondly, there is a difference between behaviors and preferences. Lying, theft, and murder are all natural things that members of our species engage in, yes, but they are typically only ever used as a means of fulfilling actual preferences.
Preferences, values, etc. are what motivate our actions. Behaviors are reinforced that tend to satisfy our preferences, stimulate pleasure, alleviate pain, move us toward our goals and away from our antigoals. It’s those preferences, those motivators of action rather than the actions themselves, that I believe Stuart Russell is talking about.
An aligned AGI will necessarily have to take our preferences into account, whether to help us achieve them more intelligently than we could on our own or to steer us toward actions that minimize conflict with others’ preferences. It would ideally take on our values (or the coherent extrapolation or aggregation of them) as its own in some way without necessarily taking on our behavioral policies as its own (though of course it would need to learn about those too).
An unaligned AGI would almost certainly learn human values as well, once it’s past a certain level of intelligence, but it would lack the empathy to care about them beyond the need to plan around them as it pursues its own goals.
“…human preferences/values/needs/desires/goals/etc. is a necessary but not sufficient condition for achieving alignment.”
I have to agree with you in this regard and most of your other points. My concern however is that Stuart’s communications give the impression that the preferences approach addresses the problem of AI learning things we consider bad when in fact it doesn’t.
The model of AI learning our preferences by observing our behavior and then proceeding with uncertainty makes sense to me. However just as Asimov’s robot characters eventually decide there is a fourth rule that overrides the other three Stuart’s “Three Principles” model seems incomplete. Preferences do not appear to me, in themselves, to deal with the issue of evil.
My take is that AGI learning human preferences/values/needs/desires/goals/etc. is a necessary but not sufficient condition for achieving alignment.
First of all, it’s important to note that just because the AGI learns that humans engage in certain behaviors, that by itself does not imply that it will have any inclination to engage in those behaviors itself. Secondly, there is a difference between behaviors and preferences. Lying, theft, and murder are all natural things that members of our species engage in, yes, but they are typically only ever used as a means of fulfilling actual preferences.
Preferences, values, etc. are what motivate our actions. Behaviors are reinforced that tend to satisfy our preferences, stimulate pleasure, alleviate pain, move us toward our goals and away from our antigoals. It’s those preferences, those motivators of action rather than the actions themselves, that I believe Stuart Russell is talking about.
An aligned AGI will necessarily have to take our preferences into account, whether to help us achieve them more intelligently than we could on our own or to steer us toward actions that minimize conflict with others’ preferences. It would ideally take on our values (or the coherent extrapolation or aggregation of them) as its own in some way without necessarily taking on our behavioral policies as its own (though of course it would need to learn about those too).
An unaligned AGI would almost certainly learn human values as well, once it’s past a certain level of intelligence, but it would lack the empathy to care about them beyond the need to plan around them as it pursues its own goals.
“…human preferences/values/needs/desires/goals/etc. is a necessary but not sufficient condition for achieving alignment.”
I have to agree with you in this regard and most of your other points. My concern however is that Stuart’s communications give the impression that the preferences approach addresses the problem of AI learning things we consider bad when in fact it doesn’t.
The model of AI learning our preferences by observing our behavior and then proceeding with uncertainty makes sense to me. However just as Asimov’s robot characters eventually decide there is a fourth rule that overrides the other three Stuart’s “Three Principles” model seems incomplete. Preferences do not appear to me, in themselves, to deal with the issue of evil.