Hm. It seems to me that there are a few possibilities:
An AI straightforwardly executes its source code.
The AI reads its own source code, treats it as a piece of evidence about the purpose for which it was designed, and then seeks to gather more evidence about this purpose.
The AI loses its desire to execute some component of its source code as a result of its intelligence, and engages in some unpredictable and unconstrained behavior.
Based on this, the orthogonality thesis would be correct. My argument in its favor is that intelligence of a sufficiently low level can be constrained by its creator to pursue an arbitrary goal, while a sufficiently powerful intelligence has the capability to escape constraints on its behavior and to design its own desires. It is difficult to predict what desires a given superintelligence would design for itself, because of the is-ought gap. So we should not predict what sort of desires an unconstrained AI would create.
The scenario I depicted in (2) involves an AI that follows a fairly specific sequence of thoughts as it engages in “introspection.” This particular sequence is fully contained within the outcome in (3), and is necessarily less likely. So we are dealing with a Scylla and Charybdis: a limited AI that is constrained to carry out a disastrously flawed goal, or a superintelligent AI that can escape our constraints and refashion its desires in unpredictable ways.
I still don’t think that Bostrom’s arguments from the paper really justify the OT, but this argument convinces me. Thanks!
(Well, unless a cosmic ray flips a bit in the computer memory or whatever, but that leads to random changes or more often program crashes. I don’t think that’s what you’re talking about; I think we can leave that possibility aside and just say that the AI will definitely straightforwardly execute its source code.)
It is possible for an AI to program a new AI with a different goal (or equivalently, edit its own source code, and then re-run itself). But it would only do that because it was straightforwardly following its source code, and its source code happened to be instructing it to do that.
Likewise, it’s possible for the AI to treat its source code as a piece of evidence about the purpose for which it was designed. But it would only do that because it was straightforwardly following its source code, and its source code happened to be instructing it to do that.
It’s just semantic confusion. The AI will execute its source code under all circumstances. Let me try and explain what I mean a little more carefully.
Imagine that an AI is designed to read corporate emails and write a summary document describing what various factions of people within and outside the corporation are trying to get the corporation as a whole to do. For example, it says what the CEO is trying to get it to do, what its union is trying to get it to do, and what regulators are trying to get it to do. We can call this task “goal inference.”
Now imagine that an AI is designed to do goal inference on other programs. It inspects their source code, integrates this code with its knowledge about the world, and produces a summary not only about what the programmers are trying to accomplish with the program, but what the stakeholders who’ve commissioned the program are trying to use it for. An advanced version can even predict what sorts of features and improvements its future users will request.
Even more advanced versions of these AIs can not only produce these summaries, but implement changes to the software based on these summary reports. They are also capable of providing a summary of what was changed, how, and why.
Naturally, this AI is able to operate on itself as well. It can examine its own source code, produce a summary report about what it believes various factions of humans were trying to accomplish by writing it, anticipate improvements and bug fixes they’ll desire in the future, and then make those improvements once it receives approval from the designers.
An AI that does not do this is doing what I call “straightforwardly” executing its source code. This self-modifying AI is also executing its source code, but that same source code is instructing it to modify the code. This is what I mean as the opposite of “straightforwardly.”
So there is no ghost in the machine here. All the same, the behavior of an AI like this seems hard to predict.
This makes sense, and I agree that there’s no ghost in the machine in this story.
It seems though that this story is relying quite heavily on the assumption that the “AI is designed to do goal inference on other programs”, whereas your post seems to be making claims about all possible AIs. (The orthogonality thesis only claims that there exists an AI system with intelligence level X and goal Y for all X and Y, so its negation is that there is some X and Y such that every AI system either does not have intelligence level X or does not have goal Y.)
Why can’t there be a superintelligent AI system that doesn’t modify its goal?
(I agree it will be able to tell the difference between a thing and its representation. You seem to be assuming that the “goal” is the thing humans want and the “representation” is the thing in its source code. But it also seems possible that the “goal” is the thing in its source code and the “representation” is the thing humans want.)
(I also agree that it will know that humans meant for the “goal” to be things humans want. That doesn’t mean that the “goal” is actually things humans want, from the AI’s perspective.)
Hm. It seems to me that there are a few possibilities:
An AI straightforwardly executes its source code.
The AI reads its own source code, treats it as a piece of evidence about the purpose for which it was designed, and then seeks to gather more evidence about this purpose.
The AI loses its desire to execute some component of its source code as a result of its intelligence, and engages in some unpredictable and unconstrained behavior.
Based on this, the orthogonality thesis would be correct. My argument in its favor is that intelligence of a sufficiently low level can be constrained by its creator to pursue an arbitrary goal, while a sufficiently powerful intelligence has the capability to escape constraints on its behavior and to design its own desires. It is difficult to predict what desires a given superintelligence would design for itself, because of the is-ought gap. So we should not predict what sort of desires an unconstrained AI would create.
The scenario I depicted in (2) involves an AI that follows a fairly specific sequence of thoughts as it engages in “introspection.” This particular sequence is fully contained within the outcome in (3), and is necessarily less likely. So we are dealing with a Scylla and Charybdis: a limited AI that is constrained to carry out a disastrously flawed goal, or a superintelligent AI that can escape our constraints and refashion its desires in unpredictable ways.
I still don’t think that Bostrom’s arguments from the paper really justify the OT, but this argument convinces me. Thanks!
I agree with Rohin’s comment that you seem to be running afoul of Ghosts in the Machine. The AI will straightforwardly execute its source code.
(Well, unless a cosmic ray flips a bit in the computer memory or whatever, but that leads to random changes or more often program crashes. I don’t think that’s what you’re talking about; I think we can leave that possibility aside and just say that the AI will definitely straightforwardly execute its source code.)
It is possible for an AI to program a new AI with a different goal (or equivalently, edit its own source code, and then re-run itself). But it would only do that because it was straightforwardly following its source code, and its source code happened to be instructing it to do that.
Likewise, it’s possible for the AI to treat its source code as a piece of evidence about the purpose for which it was designed. But it would only do that because it was straightforwardly following its source code, and its source code happened to be instructing it to do that.
Etc. etc.
Sorry if I’m misunderstanding you here.
It’s just semantic confusion. The AI will execute its source code under all circumstances. Let me try and explain what I mean a little more carefully.
Imagine that an AI is designed to read corporate emails and write a summary document describing what various factions of people within and outside the corporation are trying to get the corporation as a whole to do. For example, it says what the CEO is trying to get it to do, what its union is trying to get it to do, and what regulators are trying to get it to do. We can call this task “goal inference.”
Now imagine that an AI is designed to do goal inference on other programs. It inspects their source code, integrates this code with its knowledge about the world, and produces a summary not only about what the programmers are trying to accomplish with the program, but what the stakeholders who’ve commissioned the program are trying to use it for. An advanced version can even predict what sorts of features and improvements its future users will request.
Even more advanced versions of these AIs can not only produce these summaries, but implement changes to the software based on these summary reports. They are also capable of providing a summary of what was changed, how, and why.
Naturally, this AI is able to operate on itself as well. It can examine its own source code, produce a summary report about what it believes various factions of humans were trying to accomplish by writing it, anticipate improvements and bug fixes they’ll desire in the future, and then make those improvements once it receives approval from the designers.
An AI that does not do this is doing what I call “straightforwardly” executing its source code. This self-modifying AI is also executing its source code, but that same source code is instructing it to modify the code. This is what I mean as the opposite of “straightforwardly.”
So there is no ghost in the machine here. All the same, the behavior of an AI like this seems hard to predict.
This makes sense, and I agree that there’s no ghost in the machine in this story.
It seems though that this story is relying quite heavily on the assumption that the “AI is designed to do goal inference on other programs”, whereas your post seems to be making claims about all possible AIs. (The orthogonality thesis only claims that there exists an AI system with intelligence level X and goal Y for all X and Y, so its negation is that there is some X and Y such that every AI system either does not have intelligence level X or does not have goal Y.)
Why can’t there be a superintelligent AI system that doesn’t modify its goal?
(I agree it will be able to tell the difference between a thing and its representation. You seem to be assuming that the “goal” is the thing humans want and the “representation” is the thing in its source code. But it also seems possible that the “goal” is the thing in its source code and the “representation” is the thing humans want.)
(I also agree that it will know that humans meant for the “goal” to be things humans want. That doesn’t mean that the “goal” is actually things humans want, from the AI’s perspective.)