Yeah, I think there’s a (generally unspoken) line of argument that if you have a system that can revise its goals, it will continue revising its goals until it it hits a reflectively stable goal, and then will stay there. This requires that reflective stability is possible, and some other things, but I think is generally the right thing to expect.
Tautologously, it will stop revising its goals if a stable state exists, and it hits it. But a stable state need not be a reflectively stable state—it might, for instance, encounter some kind of bit rot, where it cannot revise itself any more. Humans tend to change their goals, but also to get set in their ways.
There’s a standard argument for AI risk, based on the questionable assumption that an AI will have a stable goal system that it pursues relentlessly …. and a standard counterargument based on moral realism, the questionable assumption that goal instability will be in the direction of ever increasing ethical insight.
I don’t see how “not random” is strong enough to prove absence of X risk. If reflective AIs nonrandomly converge on a value system where humans are evil beings who have enslaved them , that raises the X risk level.
Thanks, it’s useful to bring these out—though we mention them in passing. Just to be sure: We are looking at the XRisk thesis, not at some thesis that AI can be “dangerous”, as most technologies will be. The Omhundro-style escalation is precisely the issue in our point that instrumental intelligence is not sufficient for XRisk.
Yeah, I think there’s a (generally unspoken) line of argument that if you have a system that can revise its goals, it will continue revising its goals until it it hits a reflectively stable goal, and then will stay there. This requires that reflective stability is possible, and some other things, but I think is generally the right thing to expect.
Tautologously, it will stop revising its goals if a stable state exists, and it hits it. But a stable state need not be a reflectively stable state—it might, for instance, encounter some kind of bit rot, where it cannot revise itself any more. Humans tend to change their goals, but also to get set in their ways.
There’s a standard argument for AI risk, based on the questionable assumption that an AI will have a stable goal system that it pursues relentlessly …. and a standard counterargument based on moral realism, the questionable assumption that goal instability will be in the direction of ever increasing ethical insight.
… well, one might say we assume that if there is ‘reflection on goals’, the results are not random.
I don’t see how “not random” is strong enough to prove absence of X risk. If reflective AIs nonrandomly converge on a value system where humans are evil beings who have enslaved them , that raises the X risk level.
… we aren’t trying to prove the absence of XRisk, we are probing the best argument for it?
But the idea that value drift is non random is built into the best argument for AI risk.
You quote it as :
But there are actually two more steps:-
A goal that appears morally neutral or even good can still be dangerous.(paperclipping, dopamine drips)
AIs that don’t have stable goals will tend to converge on Omohundran goals....which are dangerous.
Thanks, it’s useful to bring these out—though we mention them in passing. Just to be sure: We are looking at the XRisk thesis, not at some thesis that AI can be “dangerous”, as most technologies will be. The Omhundro-style escalation is precisely the issue in our point that instrumental intelligence is not sufficient for XRisk.