In order for the orthogonality thesis to be true, it must be possible for the agent’s goal to remain fixed while its intelligence varies, and vice versa. Hence, it must be possible to independently alter the physical devices on which these traits are instantiated.
This is intuitive, but I am not confident this is true in general. Zooming out a bit, I understand this as saying: if we know that AGI can exist at two different points in intelligence/goal space, then there exists a path between those points in the space.
A concrete counter-example: we know that we can build machines that move with different power sources, and we can build essentially the same machine powered by different sources. So consider a Chevy Impala, with a gas-fueled combustion engine, and a Tesla Model 3, with a battery-powered electric motor. If we start with a Chevy Impala, we cannot convert it into a Tesla Model 3, or vice-versa: at a certain point, we would have changed the vehicle so much it no longer registers as an Impala.
My (casual) understanding of the orthogonality thesis is that for any given goal, an arbitrarily intelligent AGI could exist, but it doesn’t follow we could guarantee keeping the goal constant while increasing the intelligence of an extant AGI for path dependence reasons.
What do you think about the difference between changing an existing system, vs. building it to specs in the first place?
Cheers! Yes, you hit the nail on the head here. This was one of my mistakes in the post. A related one was that I thought of goals and intelligence as needing to be two separate devices, in order to allow for unlimited combinations of them. However, intelligence can be the “device” on which the goals are “running:” intelligence is responsible for remembering goals, and for evaluating, and predicting goal-oriented behavior. And we could see the same level of intelligence develop with a wide variety of goals, as different programs can run on the same operating system.
One other flaw in my thinking was that I conceived of goals as being something legiblly pre-determined, like “maximizing paperclips.” It seems likely that a company could create a superintelligent AI and try to “inject” it with a goal like that. However, the AI might very well evolve to have its own “terminal goal,” perhaps influenced but not fully determined by the human-injected goal. The best way to look at it is actually in reverse: whatever the AI tries to protect and pursue above all else is its terminal goal. The AI safety project is the attempt to gain some ability to predict and control this goal and/or the AI’s ability to pursue it.
The point of the orthogonality thesis, I now understand, is just to say that we shouldn’t rule anything out, and admit we’re not smart enough to know what will happen. We don’t know for sure if we can build a superintelligent AI, or how smart it would be. We don’t know how much control or knowledge of it we would have. And if we weren’t able to predict and control its behavior, we don’t know what goals it would develop or pursue independently of us. We don’t know if it would show goal-oriented behavior at all. But if it did show unconstrained and independent terminal goal-oriented behavior, and it was sufficiently intelligent, then we can predict that it would try to enhance and protect those terminal goals (which are tautologically defined as whatever it’s trying to enhance and protect). And some of those scenarios might represent extreme destruction.
Why don’t we have the same apocalyptic fears about other dangers? Because nothing else has a plausible story for how it could rapidly self-enhance, while also showing agentic goal-oriented behavior. So although we can spin horror stories about many technologies, we should treat superintelligent AI as having a vastly greater downside potential than anything else. It’s not just “we don’t know.” It’s not just “it could be bad.” It’s that it has a unique and plausible pathway to be categorically worse (by systematically eliminating all life) than any other modern technology. And the incentives and goals of most humans and institutions are not aligned to take a threat of that kind with nearly the seriousness that it deserves.
And none of this is to say that we know with any kind of clarity what should be done. It seems unlikely to me, but it’s possible that the status quo is somehow magically the best way to deal with this problem. We need an entirely separate line of reasoning to figure out how to solve this problem, and to rule out ineffective approaches.
Upvote for expressing your true concern!
I have a question about this thought:
This is intuitive, but I am not confident this is true in general. Zooming out a bit, I understand this as saying: if we know that AGI can exist at two different points in intelligence/goal space, then there exists a path between those points in the space.
A concrete counter-example: we know that we can build machines that move with different power sources, and we can build essentially the same machine powered by different sources. So consider a Chevy Impala, with a gas-fueled combustion engine, and a Tesla Model 3, with a battery-powered electric motor. If we start with a Chevy Impala, we cannot convert it into a Tesla Model 3, or vice-versa: at a certain point, we would have changed the vehicle so much it no longer registers as an Impala.
My (casual) understanding of the orthogonality thesis is that for any given goal, an arbitrarily intelligent AGI could exist, but it doesn’t follow we could guarantee keeping the goal constant while increasing the intelligence of an extant AGI for path dependence reasons.
What do you think about the difference between changing an existing system, vs. building it to specs in the first place?
Cheers! Yes, you hit the nail on the head here. This was one of my mistakes in the post. A related one was that I thought of goals and intelligence as needing to be two separate devices, in order to allow for unlimited combinations of them. However, intelligence can be the “device” on which the goals are “running:” intelligence is responsible for remembering goals, and for evaluating, and predicting goal-oriented behavior. And we could see the same level of intelligence develop with a wide variety of goals, as different programs can run on the same operating system.
One other flaw in my thinking was that I conceived of goals as being something legiblly pre-determined, like “maximizing paperclips.” It seems likely that a company could create a superintelligent AI and try to “inject” it with a goal like that. However, the AI might very well evolve to have its own “terminal goal,” perhaps influenced but not fully determined by the human-injected goal. The best way to look at it is actually in reverse: whatever the AI tries to protect and pursue above all else is its terminal goal. The AI safety project is the attempt to gain some ability to predict and control this goal and/or the AI’s ability to pursue it.
The point of the orthogonality thesis, I now understand, is just to say that we shouldn’t rule anything out, and admit we’re not smart enough to know what will happen. We don’t know for sure if we can build a superintelligent AI, or how smart it would be. We don’t know how much control or knowledge of it we would have. And if we weren’t able to predict and control its behavior, we don’t know what goals it would develop or pursue independently of us. We don’t know if it would show goal-oriented behavior at all. But if it did show unconstrained and independent terminal goal-oriented behavior, and it was sufficiently intelligent, then we can predict that it would try to enhance and protect those terminal goals (which are tautologically defined as whatever it’s trying to enhance and protect). And some of those scenarios might represent extreme destruction.
Why don’t we have the same apocalyptic fears about other dangers? Because nothing else has a plausible story for how it could rapidly self-enhance, while also showing agentic goal-oriented behavior. So although we can spin horror stories about many technologies, we should treat superintelligent AI as having a vastly greater downside potential than anything else. It’s not just “we don’t know.” It’s not just “it could be bad.” It’s that it has a unique and plausible pathway to be categorically worse (by systematically eliminating all life) than any other modern technology. And the incentives and goals of most humans and institutions are not aligned to take a threat of that kind with nearly the seriousness that it deserves.
And none of this is to say that we know with any kind of clarity what should be done. It seems unlikely to me, but it’s possible that the status quo is somehow magically the best way to deal with this problem. We need an entirely separate line of reasoning to figure out how to solve this problem, and to rule out ineffective approaches.