Autonomy is a value and can be expressed as a part of a utility function, I think. So ambitious value learning should be able to capture it, so an aligned AI based on ambitious value learning would respect someone’s autonomy when they value it themselves. If they don’t, why impose it upon them?
One could make a similar argument for corrigibility: ambitious value learning would respect our desire for it to behave corrigibly if we actually wanted that, and if we didn’t want that, why impose it?
Corrigibility makes sense as something to ensure in its own right because it is good to have in case the value learning is not doing what it should (or something else is going wrong).
I think respect for autonomy is similarly useful. It helps avoid evil-genie (perverse instantiation) type failures by requiring that we understand what we are asking the AI to do. It helps avoid preference-manipulation problems which value learning approaches might otherwise have, because regardless of how well expected-human-value is optimized by manipulating human preferences, such manipulation usually involves fooling the human, which violates autonomy.
(In cases where humans understand the implications of value manipulation and consent to it, it’s much less concerning—though we still want to make sure the AI isn’t prone to pressure humans into that, and think carefully about whether it is really OK.)
Is the point here that you expect we can’t solve those problems and therefore need an alternative? The idea doesn’t help with “the difficulties of assuming human rationality” though so what problems does it help with?
It’s less an alternative in terms of avoiding the things which make value learning hard, and more an alternative in terms of providing a different way to apply the same underlying insights, to make something which is less of a ruthless maximizer at the end.
In other words, it doesn’t avoid the central problems of ambitious value learning (such as “what does it mean for irrational beings to have values?”), but it is a different way to try to put those insights together into a safe system. You might add other safety precautions to an ambitious value learner, such as [ambitious value learning + corrigibility + mild optimization + low impact + transparency]. Consent-based systems could be an alternative to that agglomerated approach, either replacing some of the safety measures or making them less difficult to include by providing a different foundation to build on.
Is the idea that even trying to do ambitious value learning constitutes violating someone’s autonomy (in other words someone could have a preference against having ambitious value learning done on them) and by the time we learn this it would be too late?
I think there are a couple of ways in which this is true.
I mentioned cases where a value-learner might violate privacy in ways humans wouldn’t want, because the overall result is positive in terms of the extent to which the AI can optimize human values. This is somewhat bad, but it isn’t X-risk bad. It’s not my real concern. I pointed it out because I think it is part of the bigger picture; it provides a good example of the kind of optimization a value-learner is likely to engage in, which we don’t really want.
I think the consent/autonomy idea actually gets close (though maybe not close enough) to something fundamental about safety concerns which follow an “unexpected result of optimizing something reasonable-looking” pattern. As such, it may be better to make it an explicit design feature, rather than trust the system to realize that it should be careful about maintaining human autonomy before it does anything dangerous.
It seems plausible that, interacting with humans over time, a system which respects autonomy at a basic level would converge to different overall behavior than a value-learning system which trades autonomy off with other values. If you actually get ambitious value learning really right, this is just bad. But, I don’t endorse your “why impose it on them?” argument. Humans could eventually decide to run all-out value-learning optimization (without mild optimization, without low-impact constraints, without hard-coded corrigibility). Preserving human autonomy in the meantime seems
One could make a similar argument for corrigibility: ambitious value learning would respect our desire for it to behave corrigibly if we actually wanted that, and if we didn’t want that, why impose it?
There’s a disanalogy in that autonomy is probably a terminal value whereas corrigibility is only an instrumental one. In other words, I don’t want a corrigible AI for the sake of having a corrigible AI, I want one so it will help me reach my other goals. I do (probably) want autonomy, and not only because it would help me reach other goals. So in fact ambitious value learning will not learn to behave corrigibly, I think, because the AI will probably think it has a better way of giving me what I ultimately want.
Oh, I think I see a different way of stating your argument that avoids this disanalogy: we’re not concerned about autonomy as a terminal value here, but as an instrumental one like corrigibility. If ambitious value learning works perfectly, then it would learn autonomy as a terminal value, but we want to implement autonomy-respecting AI mainly because that would help us get what we want in case ambitious value learning fails to works perfectly.
I think I understand the basic idea and motivation now, and I’ll just point out that autonomy-respecting AI seems share several problems with other non-goal-directed approaches to AI safety.
One could make a similar argument for corrigibility: ambitious value learning would respect our desire for it to behave corrigibly if we actually wanted that, and if we didn’t want that, why impose it?
Corrigibility makes sense as something to ensure in its own right because it is good to have in case the value learning is not doing what it should (or something else is going wrong).
I think respect for autonomy is similarly useful. It helps avoid evil-genie (perverse instantiation) type failures by requiring that we understand what we are asking the AI to do. It helps avoid preference-manipulation problems which value learning approaches might otherwise have, because regardless of how well expected-human-value is optimized by manipulating human preferences, such manipulation usually involves fooling the human, which violates autonomy.
(In cases where humans understand the implications of value manipulation and consent to it, it’s much less concerning—though we still want to make sure the AI isn’t prone to pressure humans into that, and think carefully about whether it is really OK.)
It’s less an alternative in terms of avoiding the things which make value learning hard, and more an alternative in terms of providing a different way to apply the same underlying insights, to make something which is less of a ruthless maximizer at the end.
In other words, it doesn’t avoid the central problems of ambitious value learning (such as “what does it mean for irrational beings to have values?”), but it is a different way to try to put those insights together into a safe system. You might add other safety precautions to an ambitious value learner, such as [ambitious value learning + corrigibility + mild optimization + low impact + transparency]. Consent-based systems could be an alternative to that agglomerated approach, either replacing some of the safety measures or making them less difficult to include by providing a different foundation to build on.
I think there are a couple of ways in which this is true.
I mentioned cases where a value-learner might violate privacy in ways humans wouldn’t want, because the overall result is positive in terms of the extent to which the AI can optimize human values. This is somewhat bad, but it isn’t X-risk bad. It’s not my real concern. I pointed it out because I think it is part of the bigger picture; it provides a good example of the kind of optimization a value-learner is likely to engage in, which we don’t really want.
I think the consent/autonomy idea actually gets close (though maybe not close enough) to something fundamental about safety concerns which follow an “unexpected result of optimizing something reasonable-looking” pattern. As such, it may be better to make it an explicit design feature, rather than trust the system to realize that it should be careful about maintaining human autonomy before it does anything dangerous.
It seems plausible that, interacting with humans over time, a system which respects autonomy at a basic level would converge to different overall behavior than a value-learning system which trades autonomy off with other values. If you actually get ambitious value learning really right, this is just bad. But, I don’t endorse your “why impose it on them?” argument. Humans could eventually decide to run all-out value-learning optimization (without mild optimization, without low-impact constraints, without hard-coded corrigibility). Preserving human autonomy in the meantime seems
There’s a disanalogy in that autonomy is probably a terminal value whereas corrigibility is only an instrumental one. In other words, I don’t want a corrigible AI for the sake of having a corrigible AI, I want one so it will help me reach my other goals. I do (probably) want autonomy, and not only because it would help me reach other goals. So in fact ambitious value learning will not learn to behave corrigibly, I think, because the AI will probably think it has a better way of giving me what I ultimately want.
Oh, I think I see a different way of stating your argument that avoids this disanalogy: we’re not concerned about autonomy as a terminal value here, but as an instrumental one like corrigibility. If ambitious value learning works perfectly, then it would learn autonomy as a terminal value, but we want to implement autonomy-respecting AI mainly because that would help us get what we want in case ambitious value learning fails to works perfectly.
I think I understand the basic idea and motivation now, and I’ll just point out that autonomy-respecting AI seems share several problems with other non-goal-directed approaches to AI safety.