Bostrom says a human doesn’t try to disable its own goal accretion (though that process alters its values) in part because it is not well described as a utility maximizer (p190, footnote 11). Why assume AI will be so much better described as a utility maximizer that this characteristic will cease to hold?
I can think of a few reasons why it might seem like humans don’t try to disable goal accretion:
*Humans can’t easily perform reliable self-modifications, and as a result usually don’t consider things like disabling goal accretion as something that’s possible.
*When a human believes something strongly enough to want to try to fix it as a goal, mechanisms kick in to hold it in place that don’t involve consciously considering value accretion disabling as a goal. For example, confirmation bias and other cognitive biases, making costly commitments to join a group of people who also share that goal (which makes it harder to take it away, ect.).
*Cognitive biases lead us to underestimate the amount values have shifted in the past, and wildly underestimate how our values might shift in the future
*Humans believe that all value accretion is good, because it lead to the present set of values, and they are good and right. Also, humans believe that their values will not change in the future, because they feel objectively good and right (subjectively objective).
*Our final goals are inaccessible, so we don’t really know what it is we would want to fix as our goals.
*Our actual final goals (if there is something like that that can be meaningfully specified) include keeping the goal accretion mechanism running.
It seems likely that an AI system which humans understand well enough to design might have fewer of these properties.
Bostrom says a human doesn’t try to disable its own goal accretion (though that process alters its values) in part because it is not well described as a utility maximizer (p190, footnote 11). Why assume AI will be so much better described as a utility maximizer that this characteristic will cease to hold?
I can think of a few reasons why it might seem like humans don’t try to disable goal accretion:
*Humans can’t easily perform reliable self-modifications, and as a result usually don’t consider things like disabling goal accretion as something that’s possible.
*When a human believes something strongly enough to want to try to fix it as a goal, mechanisms kick in to hold it in place that don’t involve consciously considering value accretion disabling as a goal. For example, confirmation bias and other cognitive biases, making costly commitments to join a group of people who also share that goal (which makes it harder to take it away, ect.).
*Cognitive biases lead us to underestimate the amount values have shifted in the past, and wildly underestimate how our values might shift in the future
*Humans believe that all value accretion is good, because it lead to the present set of values, and they are good and right. Also, humans believe that their values will not change in the future, because they feel objectively good and right (subjectively objective).
*Our final goals are inaccessible, so we don’t really know what it is we would want to fix as our goals.
*Our actual final goals (if there is something like that that can be meaningfully specified) include keeping the goal accretion mechanism running.
It seems likely that an AI system which humans understand well enough to design might have fewer of these properties.