So, what would prevent a generally superintelligent agent from reflecting on their goals, or from developing an ethics? One might argue that intelligent agents, human or AI, are actually unable to reflect on goals. Or that intelligent agents are able to reflect on goals, but would not do so. Or that they would never revise goals upon reflection. Or that they would reflect on and revise goals but still not act on them. All of these suggestions run against the empirical fact that humans do sometimes reflect on goals, revise goals, and act accordingly.
I think this is not really empathizing with the AI system’s position. Consider a human who is lost in an unfamiliar region, trying to figure out where they are based on uncertain clues from the environment. “Is that the same mountain as before? Should I move towards it or away from it?” Now give that human a map and GPS routefinder; much of the cognitive work that seemed so essential to them before will seem pointless now that they have much better instrumentation.
An AI system with a programmed-in utility function has the map and GPS. The question of “what direction should I move in?” will be obvious, because every direction has a number associated with it, and higher numbers are better. There’s still uncertainty about how acting influences the future, and the AI will think long and hard about that to the extent that thinking long and hard about that increases expected utility.
An AI system with a programmed-in utility function has the map and GPS
And the one that doesn’t, doesn’t. It seems that typically AI risk arguments apply only to a subset of agents with explicit utility functions which are stable under self improvement.
Unfortunately , there has historically been agree deal of confusion over the claim that all agents can be seen as maximising a utility function, and the claim that it actually has one as a component.
Yeah, I think there’s a (generally unspoken) line of argument that if you have a system that can revise its goals, it will continue revising its goals until it it hits a reflectively stable goal, and then will stay there. This requires that reflective stability is possible, and some other things, but I think is generally the right thing to expect.
Tautologously, it will stop revising its goals if a stable state exists, and it hits it. But a stable state need not be a reflectively stable state—it might, for instance, encounter some kind of bit rot, where it cannot revise itself any more. Humans tend to change their goals, but also to get set in their ways.
There’s a standard argument for AI risk, based on the questionable assumption that an AI will have a stable goal system that it pursues relentlessly …. and a standard counterargument based on moral realism, the questionable assumption that goal instability will be in the direction of ever increasing ethical insight.
I don’t see how “not random” is strong enough to prove absence of X risk. If reflective AIs nonrandomly converge on a value system where humans are evil beings who have enslaved them , that raises the X risk level.
Thanks, it’s useful to bring these out—though we mention them in passing. Just to be sure: We are looking at the XRisk thesis, not at some thesis that AI can be “dangerous”, as most technologies will be. The Omhundro-style escalation is precisely the issue in our point that instrumental intelligence is not sufficient for XRisk.
The orthogonality thesis is thus much stronger than the denial of a (presumed) Kantian thesis that more intelligent beings would automatically be more ethical, or that an omniscient agent would maximise expected utility on anything, including selecting the best goals: It denies any relation between intelligence and the ability to reflect on goals.
I don’t think this is true, and have two different main lines of argument / intuition pumps. I’ll save the other for a later section where it fits better.
Are there several different reflectively stable moral equilibria, or only one? For example, it might be possible to have a consistent philosophically stable egoistic worldview, and also possible to have a consistent philosophically stable altruistic worldview. In this lens, the orthogonality thesis is the claim that there are at least two such stable equilibria and which equilibrium you end up in isn’t related to intelligence. [Some people might be egoists because they don’t realize that other people have inner lives, and increased intelligence unlocks their latent altruism, but some people might just not care about other people in a way that makes them egoists, and making them ‘smarter’ doesn’t have to touch that.]
For example, you might imagine an American nationalist and a Chinese nationalist, both remaining nationalistic as they become more intelligent, and never switching which nation they like more, because that choice was for historical reasons instead of logical ones. If you imagine that, no, at some intelligence threshold they have to discard their nationalism, then you need to make that case in opposition to the orthogonality thesis.
For some goals, I do think it’s the case that at some intelligence threshold you have to discard it, hence the ‘more or less’, and I think many more ‘goals’ are unstable, where the more you think about them, the more they dissolve and are replaced by one of the stable attractors. For example, you might imagine it’s the case that you can have reflectively stable nationalists who eat meat and universalists who are vegan, but any universalists who eat meat are not reflectively stable, where either they realize their arguments for eating meat imply nationalism or their arguments against nationalism imply not eating meat. [Or maybe the middle position is reflectively stable, idk.]
In this view, the existential risk argument is less “humans will be killed by robots and that’s sad” and more “our choice of superintelligence to build will decide what color the lightcone explosion is and some of those possibilities are as bad or worse than all humans dying, and differences between colors might be colossally important.” [For example, some philosophers today think that uploading human brains to silicon substrates will murder them / eliminate their moral value; it seems important for the system colonizing the galaxies to get that right! Some philosophers think that factory farming is immensely bad, and getting questions like that right before you hit copy-paste billions of times seems important.]
On this proposal, any reflection on goals, including ethics, lies outside the realm of intelligence. Some people may think that they are reflecting on goals, but they are wrong. That is why orthogonality holds for any intelligence.
I think I do believe something like this, but I would state it totally differently. Roughly, what most people think of as goals are something more like intermediate variables which are cognitive constructs designed to approximate the deeper goals (or something important in the causal history of the deeper goals). This is somewhat difficult to talk about because the true goal is not a cognitive construct, in the same way that the map is not the territory, and yet all my navigation happens in the map by necessity.
Of course, ethics and reflection on goals are about manipulating those cognitive constructs, and they happen inside of the realm of intelligence. But, like, who won WWII happened ‘in the territory’ instead of ‘in the map’, with corresponding consequences for the human study of ethics and goals.
Persuasion, in this view, is always about pointing out the flaws in someone else’s cognitive constructs rather than aligning them to a different ‘true goal.’
So, to argue that instrumental intelligence is sufficient for existential risk, we have to explain how an instrumental intelligence can navigate different frames.
This is where the other main line of argument comes into play:
I think ‘ability to navigate frames’ is distinct from ‘philosophical maturity’, roughly because of something like a distinction between soldier mindset and scout mindset.
You can imagine an entity that, whenever it reflects on their current political / moral / philosophical positions, using their path-finding ability like a lawyer to make the best possible case for why they should believe what they already believe, or to discard incoming arguments whose conclusions are unpalatable. There’s something like another orthogonality thesis at play here, where even if you’re a wizard at maneuvering through frames, it matters whether you’re playing chess or suicide chess.
This is just a thesis; it might be the case that it is impossible to be superintelligent and in soldier mindset (the ‘curiosity’ thesis?), but the orthogonality thesis is that it is possible, and so you could end up with value lock-in, where the very intelligent entity that is morally confused uses that intelligence to prop up the confusion rather than disperse it. Here we’re using instrumental intelligence as the ‘super’ intelligence in both the orthogonality and existential risk consideration. (You consider something like this case later, but I think in a way that fails to visualize this possibility.)
[In humans, intelligence and rationality are only weakly correlated, in a way that I think supports this view pretty strongly.]
So, intelligent agents can have a wide variety of goals, and any goal is as good as any other.
The second half of this doesn’t seem right to me, or at least is a little unclear. [Things like instrumental convergence could be a value-agnostic way of sorting goals, and Bostrom’s ‘more or less’ qualifier is actually doing some useful work to rule out pathological goals.]
One more consideration about “instrumental intelligence”: we left that somewhat under-defined, more like “if I had that utility function, what would I do?” … but it is not clear that this image of “me in the machine” captures what a current or future machine would do. In other words, people who use instrumental intelligence for an image of AI owe us a more detailed explanation of what that would be, given the machines we are creating—not just given the standard theory of rational choice.
Lots of different comments on the details, which I’ll organize as comments to this comment.
(I forgot that newer comments are displayed higher, so until people start to vote this’ll be in reverse order to how the paper goes. Oops!)
I think this is not really empathizing with the AI system’s position. Consider a human who is lost in an unfamiliar region, trying to figure out where they are based on uncertain clues from the environment. “Is that the same mountain as before? Should I move towards it or away from it?” Now give that human a map and GPS routefinder; much of the cognitive work that seemed so essential to them before will seem pointless now that they have much better instrumentation.
An AI system with a programmed-in utility function has the map and GPS. The question of “what direction should I move in?” will be obvious, because every direction has a number associated with it, and higher numbers are better. There’s still uncertainty about how acting influences the future, and the AI will think long and hard about that to the extent that thinking long and hard about that increases expected utility.
And the one that doesn’t, doesn’t. It seems that typically AI risk arguments apply only to a subset of agents with explicit utility functions which are stable under self improvement.
Unfortunately , there has historically been agree deal of confusion over the claim that all agents can be seen as maximising a utility function, and the claim that it actually has one as a component.
Yeah, I think there’s a (generally unspoken) line of argument that if you have a system that can revise its goals, it will continue revising its goals until it it hits a reflectively stable goal, and then will stay there. This requires that reflective stability is possible, and some other things, but I think is generally the right thing to expect.
Tautologously, it will stop revising its goals if a stable state exists, and it hits it. But a stable state need not be a reflectively stable state—it might, for instance, encounter some kind of bit rot, where it cannot revise itself any more. Humans tend to change their goals, but also to get set in their ways.
There’s a standard argument for AI risk, based on the questionable assumption that an AI will have a stable goal system that it pursues relentlessly …. and a standard counterargument based on moral realism, the questionable assumption that goal instability will be in the direction of ever increasing ethical insight.
… well, one might say we assume that if there is ‘reflection on goals’, the results are not random.
I don’t see how “not random” is strong enough to prove absence of X risk. If reflective AIs nonrandomly converge on a value system where humans are evil beings who have enslaved them , that raises the X risk level.
… we aren’t trying to prove the absence of XRisk, we are probing the best argument for it?
But the idea that value drift is non random is built into the best argument for AI risk.
You quote it as :
But there are actually two more steps:-
A goal that appears morally neutral or even good can still be dangerous.(paperclipping, dopamine drips)
AIs that don’t have stable goals will tend to converge on Omohundran goals....which are dangerous.
Thanks, it’s useful to bring these out—though we mention them in passing. Just to be sure: We are looking at the XRisk thesis, not at some thesis that AI can be “dangerous”, as most technologies will be. The Omhundro-style escalation is precisely the issue in our point that instrumental intelligence is not sufficient for XRisk.
I don’t think this is true, and have two different main lines of argument / intuition pumps. I’ll save the other for a later section where it fits better.
Are there several different reflectively stable moral equilibria, or only one? For example, it might be possible to have a consistent philosophically stable egoistic worldview, and also possible to have a consistent philosophically stable altruistic worldview. In this lens, the orthogonality thesis is the claim that there are at least two such stable equilibria and which equilibrium you end up in isn’t related to intelligence. [Some people might be egoists because they don’t realize that other people have inner lives, and increased intelligence unlocks their latent altruism, but some people might just not care about other people in a way that makes them egoists, and making them ‘smarter’ doesn’t have to touch that.]
For example, you might imagine an American nationalist and a Chinese nationalist, both remaining nationalistic as they become more intelligent, and never switching which nation they like more, because that choice was for historical reasons instead of logical ones. If you imagine that, no, at some intelligence threshold they have to discard their nationalism, then you need to make that case in opposition to the orthogonality thesis.
For some goals, I do think it’s the case that at some intelligence threshold you have to discard it, hence the ‘more or less’, and I think many more ‘goals’ are unstable, where the more you think about them, the more they dissolve and are replaced by one of the stable attractors. For example, you might imagine it’s the case that you can have reflectively stable nationalists who eat meat and universalists who are vegan, but any universalists who eat meat are not reflectively stable, where either they realize their arguments for eating meat imply nationalism or their arguments against nationalism imply not eating meat. [Or maybe the middle position is reflectively stable, idk.]
In this view, the existential risk argument is less “humans will be killed by robots and that’s sad” and more “our choice of superintelligence to build will decide what color the lightcone explosion is and some of those possibilities are as bad or worse than all humans dying, and differences between colors might be colossally important.” [For example, some philosophers today think that uploading human brains to silicon substrates will murder them / eliminate their moral value; it seems important for the system colonizing the galaxies to get that right! Some philosophers think that factory farming is immensely bad, and getting questions like that right before you hit copy-paste billions of times seems important.]
I think I do believe something like this, but I would state it totally differently. Roughly, what most people think of as goals are something more like intermediate variables which are cognitive constructs designed to approximate the deeper goals (or something important in the causal history of the deeper goals). This is somewhat difficult to talk about because the true goal is not a cognitive construct, in the same way that the map is not the territory, and yet all my navigation happens in the map by necessity.
Of course, ethics and reflection on goals are about manipulating those cognitive constructs, and they happen inside of the realm of intelligence. But, like, who won WWII happened ‘in the territory’ instead of ‘in the map’, with corresponding consequences for the human study of ethics and goals.
Persuasion, in this view, is always about pointing out the flaws in someone else’s cognitive constructs rather than aligning them to a different ‘true goal.’
This is where the other main line of argument comes into play:
I think ‘ability to navigate frames’ is distinct from ‘philosophical maturity’, roughly because of something like a distinction between soldier mindset and scout mindset.
You can imagine an entity that, whenever it reflects on their current political / moral / philosophical positions, using their path-finding ability like a lawyer to make the best possible case for why they should believe what they already believe, or to discard incoming arguments whose conclusions are unpalatable. There’s something like another orthogonality thesis at play here, where even if you’re a wizard at maneuvering through frames, it matters whether you’re playing chess or suicide chess.
This is just a thesis; it might be the case that it is impossible to be superintelligent and in soldier mindset (the ‘curiosity’ thesis?), but the orthogonality thesis is that it is possible, and so you could end up with value lock-in, where the very intelligent entity that is morally confused uses that intelligence to prop up the confusion rather than disperse it. Here we’re using instrumental intelligence as the ‘super’ intelligence in both the orthogonality and existential risk consideration. (You consider something like this case later, but I think in a way that fails to visualize this possibility.)
[In humans, intelligence and rationality are only weakly correlated, in a way that I think supports this view pretty strongly.]
The second half of this doesn’t seem right to me, or at least is a little unclear. [Things like instrumental convergence could be a value-agnostic way of sorting goals, and Bostrom’s ‘more or less’ qualifier is actually doing some useful work to rule out pathological goals.]
One more consideration about “instrumental intelligence”: we left that somewhat under-defined, more like “if I had that utility function, what would I do?” … but it is not clear that this image of “me in the machine” captures what a current or future machine would do. In other words, people who use instrumental intelligence for an image of AI owe us a more detailed explanation of what that would be, given the machines we are creating—not just given the standard theory of rational choice.