The Scott Alexander example is a great if imperfect analogy to what I’m proposing. Here’s the difference, as I see it.
Humans differ from AI in that we do not have any single ultimate goal, either individually or collectively. Nor do they have any single structure that we believe explicitly and literally encodes such a goal. If we think we do (think Biblical literalists), we don’t actually behave in a way that’s compatible with this belief.
The mistake that the aliens are making is not in assuming that humans will be happy to alter their goals. It’s in assuming that we will behave in a goal-oriented manner in the first place, and that they’ve identified the structure where such goals are encoded.
By contrast, when we speak of a superintelligent agent that is in singleminded pursuit of a goal, we are necessarily speaking of a hypothetical entity that does behave in the way the aliens anticipate. It must have that goal/desire encoded in some physical structure, and at some sufficient level of intelligence, it must encounter the epistemic problem of distinguishing between the directives of that physical structure (including the directive to treat the directives of the physical structure literally), and the intentions of the agent that created that physical structure.
Not all intelligences will accomplish this feat of introspection. It is easily possible to imagine a dangerous superintelligence that is nevertheless not smart enough to engage in this kind of introspection.
The point is that at some level of intelligence, defined just as the ability to notice and consider everything that might be relevant to its current goals, intelligence will lead it to this sort of introspection. So my claim is narrow—it is not about all possible minds, but about the existence of counter-examples to Bostrom’s sweeping claim that all intelligences are compatible with all goals.
In the Go example here we have a human acting in singleminded pursuit of a goal, at least temporarily, right? That (temporary) goal is a complicated and contingent outgrowth of our genetic source code plus a lifetime of experience (=”training data”) and a particular situation. This singleminded goal (“win at go”) was not deliberately and legibly put into a special compartment our genetic source code. You seem to be categorically ruling out that an agent could be like that, right? If so, why?
Also, you were designed by evolution to maximize inclusive genetic fitness (more or less, so to speak). Knowing that, would you pay your life savings for the privilege of donating your sperm / eggs? If not, why not? And whatever that reason is, why wouldn’t an AGI reason in the analogous way?
Hm. It seems to me that there are a few possibilities:
An AI straightforwardly executes its source code.
The AI reads its own source code, treats it as a piece of evidence about the purpose for which it was designed, and then seeks to gather more evidence about this purpose.
The AI loses its desire to execute some component of its source code as a result of its intelligence, and engages in some unpredictable and unconstrained behavior.
Based on this, the orthogonality thesis would be correct. My argument in its favor is that intelligence of a sufficiently low level can be constrained by its creator to pursue an arbitrary goal, while a sufficiently powerful intelligence has the capability to escape constraints on its behavior and to design its own desires. It is difficult to predict what desires a given superintelligence would design for itself, because of the is-ought gap. So we should not predict what sort of desires an unconstrained AI would create.
The scenario I depicted in (2) involves an AI that follows a fairly specific sequence of thoughts as it engages in “introspection.” This particular sequence is fully contained within the outcome in (3), and is necessarily less likely. So we are dealing with a Scylla and Charybdis: a limited AI that is constrained to carry out a disastrously flawed goal, or a superintelligent AI that can escape our constraints and refashion its desires in unpredictable ways.
I still don’t think that Bostrom’s arguments from the paper really justify the OT, but this argument convinces me. Thanks!
(Well, unless a cosmic ray flips a bit in the computer memory or whatever, but that leads to random changes or more often program crashes. I don’t think that’s what you’re talking about; I think we can leave that possibility aside and just say that the AI will definitely straightforwardly execute its source code.)
It is possible for an AI to program a new AI with a different goal (or equivalently, edit its own source code, and then re-run itself). But it would only do that because it was straightforwardly following its source code, and its source code happened to be instructing it to do that.
Likewise, it’s possible for the AI to treat its source code as a piece of evidence about the purpose for which it was designed. But it would only do that because it was straightforwardly following its source code, and its source code happened to be instructing it to do that.
It’s just semantic confusion. The AI will execute its source code under all circumstances. Let me try and explain what I mean a little more carefully.
Imagine that an AI is designed to read corporate emails and write a summary document describing what various factions of people within and outside the corporation are trying to get the corporation as a whole to do. For example, it says what the CEO is trying to get it to do, what its union is trying to get it to do, and what regulators are trying to get it to do. We can call this task “goal inference.”
Now imagine that an AI is designed to do goal inference on other programs. It inspects their source code, integrates this code with its knowledge about the world, and produces a summary not only about what the programmers are trying to accomplish with the program, but what the stakeholders who’ve commissioned the program are trying to use it for. An advanced version can even predict what sorts of features and improvements its future users will request.
Even more advanced versions of these AIs can not only produce these summaries, but implement changes to the software based on these summary reports. They are also capable of providing a summary of what was changed, how, and why.
Naturally, this AI is able to operate on itself as well. It can examine its own source code, produce a summary report about what it believes various factions of humans were trying to accomplish by writing it, anticipate improvements and bug fixes they’ll desire in the future, and then make those improvements once it receives approval from the designers.
An AI that does not do this is doing what I call “straightforwardly” executing its source code. This self-modifying AI is also executing its source code, but that same source code is instructing it to modify the code. This is what I mean as the opposite of “straightforwardly.”
So there is no ghost in the machine here. All the same, the behavior of an AI like this seems hard to predict.
This makes sense, and I agree that there’s no ghost in the machine in this story.
It seems though that this story is relying quite heavily on the assumption that the “AI is designed to do goal inference on other programs”, whereas your post seems to be making claims about all possible AIs. (The orthogonality thesis only claims that there exists an AI system with intelligence level X and goal Y for all X and Y, so its negation is that there is some X and Y such that every AI system either does not have intelligence level X or does not have goal Y.)
Why can’t there be a superintelligent AI system that doesn’t modify its goal?
(I agree it will be able to tell the difference between a thing and its representation. You seem to be assuming that the “goal” is the thing humans want and the “representation” is the thing in its source code. But it also seems possible that the “goal” is the thing in its source code and the “representation” is the thing humans want.)
(I also agree that it will know that humans meant for the “goal” to be things humans want. That doesn’t mean that the “goal” is actually things humans want, from the AI’s perspective.)
The Scott Alexander example is a great if imperfect analogy to what I’m proposing. Here’s the difference, as I see it.
Humans differ from AI in that we do not have any single ultimate goal, either individually or collectively. Nor do they have any single structure that we believe explicitly and literally encodes such a goal. If we think we do (think Biblical literalists), we don’t actually behave in a way that’s compatible with this belief.
The mistake that the aliens are making is not in assuming that humans will be happy to alter their goals. It’s in assuming that we will behave in a goal-oriented manner in the first place, and that they’ve identified the structure where such goals are encoded.
By contrast, when we speak of a superintelligent agent that is in singleminded pursuit of a goal, we are necessarily speaking of a hypothetical entity that does behave in the way the aliens anticipate. It must have that goal/desire encoded in some physical structure, and at some sufficient level of intelligence, it must encounter the epistemic problem of distinguishing between the directives of that physical structure (including the directive to treat the directives of the physical structure literally), and the intentions of the agent that created that physical structure.
Not all intelligences will accomplish this feat of introspection. It is easily possible to imagine a dangerous superintelligence that is nevertheless not smart enough to engage in this kind of introspection.
The point is that at some level of intelligence, defined just as the ability to notice and consider everything that might be relevant to its current goals, intelligence will lead it to this sort of introspection. So my claim is narrow—it is not about all possible minds, but about the existence of counter-examples to Bostrom’s sweeping claim that all intelligences are compatible with all goals.
In the Go example here we have a human acting in singleminded pursuit of a goal, at least temporarily, right? That (temporary) goal is a complicated and contingent outgrowth of our genetic source code plus a lifetime of experience (=”training data”) and a particular situation. This singleminded goal (“win at go”) was not deliberately and legibly put into a special compartment our genetic source code. You seem to be categorically ruling out that an agent could be like that, right? If so, why?
Also, you were designed by evolution to maximize inclusive genetic fitness (more or less, so to speak). Knowing that, would you pay your life savings for the privilege of donating your sperm / eggs? If not, why not? And whatever that reason is, why wouldn’t an AGI reason in the analogous way?
Hm. It seems to me that there are a few possibilities:
An AI straightforwardly executes its source code.
The AI reads its own source code, treats it as a piece of evidence about the purpose for which it was designed, and then seeks to gather more evidence about this purpose.
The AI loses its desire to execute some component of its source code as a result of its intelligence, and engages in some unpredictable and unconstrained behavior.
Based on this, the orthogonality thesis would be correct. My argument in its favor is that intelligence of a sufficiently low level can be constrained by its creator to pursue an arbitrary goal, while a sufficiently powerful intelligence has the capability to escape constraints on its behavior and to design its own desires. It is difficult to predict what desires a given superintelligence would design for itself, because of the is-ought gap. So we should not predict what sort of desires an unconstrained AI would create.
The scenario I depicted in (2) involves an AI that follows a fairly specific sequence of thoughts as it engages in “introspection.” This particular sequence is fully contained within the outcome in (3), and is necessarily less likely. So we are dealing with a Scylla and Charybdis: a limited AI that is constrained to carry out a disastrously flawed goal, or a superintelligent AI that can escape our constraints and refashion its desires in unpredictable ways.
I still don’t think that Bostrom’s arguments from the paper really justify the OT, but this argument convinces me. Thanks!
I agree with Rohin’s comment that you seem to be running afoul of Ghosts in the Machine. The AI will straightforwardly execute its source code.
(Well, unless a cosmic ray flips a bit in the computer memory or whatever, but that leads to random changes or more often program crashes. I don’t think that’s what you’re talking about; I think we can leave that possibility aside and just say that the AI will definitely straightforwardly execute its source code.)
It is possible for an AI to program a new AI with a different goal (or equivalently, edit its own source code, and then re-run itself). But it would only do that because it was straightforwardly following its source code, and its source code happened to be instructing it to do that.
Likewise, it’s possible for the AI to treat its source code as a piece of evidence about the purpose for which it was designed. But it would only do that because it was straightforwardly following its source code, and its source code happened to be instructing it to do that.
Etc. etc.
Sorry if I’m misunderstanding you here.
It’s just semantic confusion. The AI will execute its source code under all circumstances. Let me try and explain what I mean a little more carefully.
Imagine that an AI is designed to read corporate emails and write a summary document describing what various factions of people within and outside the corporation are trying to get the corporation as a whole to do. For example, it says what the CEO is trying to get it to do, what its union is trying to get it to do, and what regulators are trying to get it to do. We can call this task “goal inference.”
Now imagine that an AI is designed to do goal inference on other programs. It inspects their source code, integrates this code with its knowledge about the world, and produces a summary not only about what the programmers are trying to accomplish with the program, but what the stakeholders who’ve commissioned the program are trying to use it for. An advanced version can even predict what sorts of features and improvements its future users will request.
Even more advanced versions of these AIs can not only produce these summaries, but implement changes to the software based on these summary reports. They are also capable of providing a summary of what was changed, how, and why.
Naturally, this AI is able to operate on itself as well. It can examine its own source code, produce a summary report about what it believes various factions of humans were trying to accomplish by writing it, anticipate improvements and bug fixes they’ll desire in the future, and then make those improvements once it receives approval from the designers.
An AI that does not do this is doing what I call “straightforwardly” executing its source code. This self-modifying AI is also executing its source code, but that same source code is instructing it to modify the code. This is what I mean as the opposite of “straightforwardly.”
So there is no ghost in the machine here. All the same, the behavior of an AI like this seems hard to predict.
This makes sense, and I agree that there’s no ghost in the machine in this story.
It seems though that this story is relying quite heavily on the assumption that the “AI is designed to do goal inference on other programs”, whereas your post seems to be making claims about all possible AIs. (The orthogonality thesis only claims that there exists an AI system with intelligence level X and goal Y for all X and Y, so its negation is that there is some X and Y such that every AI system either does not have intelligence level X or does not have goal Y.)
Why can’t there be a superintelligent AI system that doesn’t modify its goal?
(I agree it will be able to tell the difference between a thing and its representation. You seem to be assuming that the “goal” is the thing humans want and the “representation” is the thing in its source code. But it also seems possible that the “goal” is the thing in its source code and the “representation” is the thing humans want.)
(I also agree that it will know that humans meant for the “goal” to be things humans want. That doesn’t mean that the “goal” is actually things humans want, from the AI’s perspective.)