Does the orthogonality thesis apply to embodied agents?
My belief is that instrumental subgoals will lead to natural human value alignment for embodied agents with long enough time horizons, but the whole thing is contingent on problems with the AI’s body.
Simply put, hardware sucks, it’s always falling apart, and the AGI would likely see human beings as part of itself . There are no large scale datacenters where _everything_ is automated, and even if there were on, who is going to repair the trucks to mine the copper to make the coils to go into the cooling fans that need to be periodically replaced?
If you start pulling strings on ‘how much of the global economy needs to operate in order to keep a data center functioning’, you end up with a huge portion of the global economy. Am i to believe that an AGI would decide to replace all of that with untested, unproved systems?
When I’ve looked into precisely what the AI risk researchers believe, the only paper I could find on likely convergent instrumental subgoals modeled the AI as being a disembodied agent with read access to the entire universe, which i find questionable. I agree that yes, if there were a disembodied mind with read access to the entire universe, the ability to write in a few places, and goals that didn’t include “keep humans alive and well”, then we’d be in trouble.
Can you help me find some resources on how embodiment changes the nature of convergent instrumental subgoals? This MIRI paper was the most recent thing i could find but it’s for non-embodied agents. Here is my objection to its conclusion.
The orthogonality thesis is about final goals, not instrumental goals. I think what you’re getting at is the following hypothesis:
“For almost any final goal that an AGI might have, it is instrumentally useful for the AGI to not wipe out humanity, because without humanity keeping the power grid on (etc. etc.), the AGI would not be able to survive and get things done. Therefore we should not expect AGIs to wipe out humanity.”.
If that claim is true now, and for early AGIs, well it won’t be true forever. Once we have AGIs that are faster and cheaper and more insightful than humans, they will displace humans over more and more of the economy, and relatedly robots will rapidly spread throughout the economy, etc. And when the AGIs are finally in a position to wipe out humans, they will, if they’re misaligned. See for example this Paul Christiano post.
Even if that claim is true, it doesn’t rule out AGI takeover. It just means that the AGI would take over in a way that doesn’t involve killing all the humans. By the same token, when Stalin took over Russia, he couldn’t run the Russian economy all by himself, and therefore he didn’t literally kill everyone, but he still had a great deal of control. Now imagine that Stalin could live forever, while gradually distributing more and more clones of himself in more and more positions of power, and then replace “Russia” with “everywhere”.
Maybe the claim is not true even for early AGIs. For example I was disagreeing with it here. A lot depends on things like how far recursive self-improvement can go, whether human-level human-speed AGI can run on 1 xbox GPU versus a datacenter of 10,000 high-end GPUs, and various questions like that. I would specifically push back on the relevance of “If you start pulling strings on ‘how much of the global economy needs to operate in order to keep a data center functioning’, you end up with a huge portion of the global economy”. There are decisions that trade off between self-sufficiency and convenience / price, and everybody will choose convenience / price. So you can’t figure out the minimal economy that would theoretically support chip production by just looking at the actual economy; you need to think more creatively, just like a person stuck on a desert island will think of resourceful solutions. By analogy, an insanely massive infrastructure across the globe supports my consumption of a chocolate bar for snack just now, but you can’t conclude from that observation that there’s no way for me to have a snack if that massive global infrastructure didn’t exist. I could instead grow grapes in my garden or whatever. Thus, I claim that there are much more expensive and time-consuming ways to get energy and chips and robot parts, that require much less infrastructure and manpower, e.g. e-beam lithography instead of EUV photolithography, and that after an AGI has wiped out humanity, it might well be able to survive indefinitely and gradually work its way up to a civilization of trillions of its clones colonizing the galaxy, starting with these more artisanal solutions, scavenged supplies, and whatever else. At the very least, I think we should have some uncertainty here.
Does the orthogonality thesis apply to embodied agents?
My belief is that instrumental subgoals will lead to natural human value alignment for embodied agents with long enough time horizons, but the whole thing is contingent on problems with the AI’s body.
Simply put, hardware sucks, it’s always falling apart, and the AGI would likely see human beings as part of itself . There are no large scale datacenters where _everything_ is automated, and even if there were on, who is going to repair the trucks to mine the copper to make the coils to go into the cooling fans that need to be periodically replaced?
If you start pulling strings on ‘how much of the global economy needs to operate in order to keep a data center functioning’, you end up with a huge portion of the global economy. Am i to believe that an AGI would decide to replace all of that with untested, unproved systems?
When I’ve looked into precisely what the AI risk researchers believe, the only paper I could find on likely convergent instrumental subgoals modeled the AI as being a disembodied agent with read access to the entire universe, which i find questionable. I agree that yes, if there were a disembodied mind with read access to the entire universe, the ability to write in a few places, and goals that didn’t include “keep humans alive and well”, then we’d be in trouble.
Can you help me find some resources on how embodiment changes the nature of convergent instrumental subgoals? This MIRI paper was the most recent thing i could find but it’s for non-embodied agents. Here is my objection to its conclusion.
The orthogonality thesis is about final goals, not instrumental goals. I think what you’re getting at is the following hypothesis:
“For almost any final goal that an AGI might have, it is instrumentally useful for the AGI to not wipe out humanity, because without humanity keeping the power grid on (etc. etc.), the AGI would not be able to survive and get things done. Therefore we should not expect AGIs to wipe out humanity.”.
See for example 1, 2, 3 making that point.
Some counterpoints to that argument would be:
If that claim is true now, and for early AGIs, well it won’t be true forever. Once we have AGIs that are faster and cheaper and more insightful than humans, they will displace humans over more and more of the economy, and relatedly robots will rapidly spread throughout the economy, etc. And when the AGIs are finally in a position to wipe out humans, they will, if they’re misaligned. See for example this Paul Christiano post.
Even if that claim is true, it doesn’t rule out AGI takeover. It just means that the AGI would take over in a way that doesn’t involve killing all the humans. By the same token, when Stalin took over Russia, he couldn’t run the Russian economy all by himself, and therefore he didn’t literally kill everyone, but he still had a great deal of control. Now imagine that Stalin could live forever, while gradually distributing more and more clones of himself in more and more positions of power, and then replace “Russia” with “everywhere”.
Maybe the claim is not true even for early AGIs. For example I was disagreeing with it here. A lot depends on things like how far recursive self-improvement can go, whether human-level human-speed AGI can run on 1 xbox GPU versus a datacenter of 10,000 high-end GPUs, and various questions like that. I would specifically push back on the relevance of “If you start pulling strings on ‘how much of the global economy needs to operate in order to keep a data center functioning’, you end up with a huge portion of the global economy”. There are decisions that trade off between self-sufficiency and convenience / price, and everybody will choose convenience / price. So you can’t figure out the minimal economy that would theoretically support chip production by just looking at the actual economy; you need to think more creatively, just like a person stuck on a desert island will think of resourceful solutions. By analogy, an insanely massive infrastructure across the globe supports my consumption of a chocolate bar for snack just now, but you can’t conclude from that observation that there’s no way for me to have a snack if that massive global infrastructure didn’t exist. I could instead grow grapes in my garden or whatever. Thus, I claim that there are much more expensive and time-consuming ways to get energy and chips and robot parts, that require much less infrastructure and manpower, e.g. e-beam lithography instead of EUV photolithography, and that after an AGI has wiped out humanity, it might well be able to survive indefinitely and gradually work its way up to a civilization of trillions of its clones colonizing the galaxy, starting with these more artisanal solutions, scavenged supplies, and whatever else. At the very least, I think we should have some uncertainty here.