So, what would prevent a generally superintelligent agent from reflecting on their goals, or from developing an ethics? One might argue that intelligent agents, human or AI, are actually unable to reflect on goals. Or that intelligent agents are able to reflect on goals, but would not do so. Or that they would never revise goals upon reflection. Or that they would reflect on and revise goals but still not act on them. All of these suggestions run against the empirical fact that humans do sometimes reflect on goals, revise goals, and act accordingly.
I think this is not really empathizing with the AI system’s position. Consider a human who is lost in an unfamiliar region, trying to figure out where they are based on uncertain clues from the environment. “Is that the same mountain as before? Should I move towards it or away from it?” Now give that human a map and GPS routefinder; much of the cognitive work that seemed so essential to them before will seem pointless now that they have much better instrumentation.
An AI system with a programmed-in utility function has the map and GPS. The question of “what direction should I move in?” will be obvious, because every direction has a number associated with it, and higher numbers are better. There’s still uncertainty about how acting influences the future, and the AI will think long and hard about that to the extent that thinking long and hard about that increases expected utility.
An AI system with a programmed-in utility function has the map and GPS
And the one that doesn’t, doesn’t. It seems that typically AI risk arguments apply only to a subset of agents with explicit utility functions which are stable under self improvement.
Unfortunately , there has historically been agree deal of confusion over the claim that all agents can be seen as maximising a utility function, and the claim that it actually has one as a component.
Yeah, I think there’s a (generally unspoken) line of argument that if you have a system that can revise its goals, it will continue revising its goals until it it hits a reflectively stable goal, and then will stay there. This requires that reflective stability is possible, and some other things, but I think is generally the right thing to expect.
Tautologously, it will stop revising its goals if a stable state exists, and it hits it. But a stable state need not be a reflectively stable state—it might, for instance, encounter some kind of bit rot, where it cannot revise itself any more. Humans tend to change their goals, but also to get set in their ways.
There’s a standard argument for AI risk, based on the questionable assumption that an AI will have a stable goal system that it pursues relentlessly …. and a standard counterargument based on moral realism, the questionable assumption that goal instability will be in the direction of ever increasing ethical insight.
I don’t see how “not random” is strong enough to prove absence of X risk. If reflective AIs nonrandomly converge on a value system where humans are evil beings who have enslaved them , that raises the X risk level.
Thanks, it’s useful to bring these out—though we mention them in passing. Just to be sure: We are looking at the XRisk thesis, not at some thesis that AI can be “dangerous”, as most technologies will be. The Omhundro-style escalation is precisely the issue in our point that instrumental intelligence is not sufficient for XRisk.
I think this is not really empathizing with the AI system’s position. Consider a human who is lost in an unfamiliar region, trying to figure out where they are based on uncertain clues from the environment. “Is that the same mountain as before? Should I move towards it or away from it?” Now give that human a map and GPS routefinder; much of the cognitive work that seemed so essential to them before will seem pointless now that they have much better instrumentation.
An AI system with a programmed-in utility function has the map and GPS. The question of “what direction should I move in?” will be obvious, because every direction has a number associated with it, and higher numbers are better. There’s still uncertainty about how acting influences the future, and the AI will think long and hard about that to the extent that thinking long and hard about that increases expected utility.
And the one that doesn’t, doesn’t. It seems that typically AI risk arguments apply only to a subset of agents with explicit utility functions which are stable under self improvement.
Unfortunately , there has historically been agree deal of confusion over the claim that all agents can be seen as maximising a utility function, and the claim that it actually has one as a component.
Yeah, I think there’s a (generally unspoken) line of argument that if you have a system that can revise its goals, it will continue revising its goals until it it hits a reflectively stable goal, and then will stay there. This requires that reflective stability is possible, and some other things, but I think is generally the right thing to expect.
Tautologously, it will stop revising its goals if a stable state exists, and it hits it. But a stable state need not be a reflectively stable state—it might, for instance, encounter some kind of bit rot, where it cannot revise itself any more. Humans tend to change their goals, but also to get set in their ways.
There’s a standard argument for AI risk, based on the questionable assumption that an AI will have a stable goal system that it pursues relentlessly …. and a standard counterargument based on moral realism, the questionable assumption that goal instability will be in the direction of ever increasing ethical insight.
… well, one might say we assume that if there is ‘reflection on goals’, the results are not random.
I don’t see how “not random” is strong enough to prove absence of X risk. If reflective AIs nonrandomly converge on a value system where humans are evil beings who have enslaved them , that raises the X risk level.
… we aren’t trying to prove the absence of XRisk, we are probing the best argument for it?
But the idea that value drift is non random is built into the best argument for AI risk.
You quote it as :
But there are actually two more steps:-
A goal that appears morally neutral or even good can still be dangerous.(paperclipping, dopamine drips)
AIs that don’t have stable goals will tend to converge on Omohundran goals....which are dangerous.
Thanks, it’s useful to bring these out—though we mention them in passing. Just to be sure: We are looking at the XRisk thesis, not at some thesis that AI can be “dangerous”, as most technologies will be. The Omhundro-style escalation is precisely the issue in our point that instrumental intelligence is not sufficient for XRisk.