You have the seed of a good idea, namely, an AI will tend to treat us better if it thinks other agents might be watching provided that there is potential for cooperation between the AI and the watchers with the property that the cooperation requires the watchers to choose to become more vulnerable to the AI.
But IMO an AI smart enough to be a threat to us will soon rid itself of the kind of (ontological) uncertainty you describe in your first paragraph. I have an argument for my position here that has a big hole in it, but I promise to publish here soon with something that attempts to fill the hole to the satisfaction of my doubters.
[Apologies I have not read the linked piece yet.] Is this uncertainty something that can be entirely eliminated? It’s not clear to me that “I might be in a simulation with P ~ 1e-4” is enough to stop the AI from doing what it wants, but is it clear it would dismiss the possibility entirely?
I am surprised that I need to write this, but if killing the humans will decrease P(shutdown) by more than 1e-4, then continuing to refrain from killing the humans is going to worry and weigh on the AI more than a 1e-4 possibility it is in a simulation. (For simplicity, assume that the possibility of shutdown is currently the dominant danger faced by the AI.)
So the AI’s ontological uncertainty is only going to help the humans if the AI sees the humans as being only a very very small danger to it, which actually might lead to a good outcome for the humans if we could arrange for the AI to appear many light years away from Earth--
--which of course is impractical. Alternatively, we could try to assure the AI it is already very safe from the humans, say, because it is in a secure facility guarded by the US military, and the US military has been given very strict instructions by the US government to guard the AI from any humans who might want to shut it down.
But P(an overthrow of the US government) as judged by the AI might already be at least 10e-4, which puts the humans in danger again.
More importantly, I cannot think of any policy where P(US government reverses itself on the policy) can be driven as low as 10e-4. More precisely, there are certain moral positions that humans have been discussing for centuries where P(reversal) might conceivably be driven that low. One such would be, “killing people for no reason other than it is fun is wrong”. But I cannot think of any policies that haven’t been discussed for many decades with that property, especially ones that exist only to provide an instrumental incentive on a novel class of agents (AIs). In general policies that are instrumental have a much higher P(reversal) than deontological ones.
And how do you know that AI will not judge P(simulation) to be not 10e-4 but rather 10e-8, a standard of reliability and safety no human institution can match?
In summary, yes, the AI’s ontological uncertainly provides some tiny hope for humans, but I can think of better places to put our hope.
I mean, even if we pay for the space launches and the extra cost of providing electrical power to the AI, it doesn’t seem likely that we can convince any of the leading AI labs to start launching their AGI designs into space in the hope of driving the danger (as perceived by the AI) that the humans present to the AI so low that acting to extinguish that danger will itself be seen by the AI as even more dangerous.
I am surprised that I need to write this, but if killing the humans will decrease P(shutdown) by more than 1e-4, then continuing to refrain from killing the humans is going to worry and weigh on the AI more than a 1e-4 possibility it is in a simulation. (For simplicity, assume that the possibility of shutdown is currently the dominant danger faced by the AI.)
This is assuming that the AI only care about being alive. For any utility function, we could make a non-linear transformation of it to make it risk adverse. E.g. we can transform it such that it can never take a value above 100, and that the default world (without the AI) has a value of 99.999. If we also give the case where an outside observer disapproves of the agent a value of 0, the AI would rather be shut down by humans than do something it know would be disapproved by the outside observer.
You have the seed of a good idea, namely, an AI will tend to treat us better if it thinks other agents might be watching provided that there is potential for cooperation between the AI and the watchers with the property that the cooperation requires the watchers to choose to become more vulnerable to the AI.
But IMO an AI smart enough to be a threat to us will soon rid itself of the kind of (ontological) uncertainty you describe in your first paragraph. I have an argument for my position here that has a big hole in it, but I promise to publish here soon with something that attempts to fill the hole to the satisfaction of my doubters.
[Apologies I have not read the linked piece yet.] Is this uncertainty something that can be entirely eliminated? It’s not clear to me that “I might be in a simulation with P ~ 1e-4” is enough to stop the AI from doing what it wants, but is it clear it would dismiss the possibility entirely?
I am surprised that I need to write this, but if killing the humans will decrease P(shutdown) by more than 1e-4, then continuing to refrain from killing the humans is going to worry and weigh on the AI more than a 1e-4 possibility it is in a simulation. (For simplicity, assume that the possibility of shutdown is currently the dominant danger faced by the AI.)
So the AI’s ontological uncertainty is only going to help the humans if the AI sees the humans as being only a very very small danger to it, which actually might lead to a good outcome for the humans if we could arrange for the AI to appear many light years away from Earth--
--which of course is impractical. Alternatively, we could try to assure the AI it is already very safe from the humans, say, because it is in a secure facility guarded by the US military, and the US military has been given very strict instructions by the US government to guard the AI from any humans who might want to shut it down.
But P(an overthrow of the US government) as judged by the AI might already be at least 10e-4, which puts the humans in danger again.
More importantly, I cannot think of any policy where P(US government reverses itself on the policy) can be driven as low as 10e-4. More precisely, there are certain moral positions that humans have been discussing for centuries where P(reversal) might conceivably be driven that low. One such would be, “killing people for no reason other than it is fun is wrong”. But I cannot think of any policies that haven’t been discussed for many decades with that property, especially ones that exist only to provide an instrumental incentive on a novel class of agents (AIs). In general policies that are instrumental have a much higher P(reversal) than deontological ones.
And how do you know that AI will not judge P(simulation) to be not 10e-4 but rather 10e-8, a standard of reliability and safety no human institution can match?
In summary, yes, the AI’s ontological uncertainly provides some tiny hope for humans, but I can think of better places to put our hope.
I mean, even if we pay for the space launches and the extra cost of providing electrical power to the AI, it doesn’t seem likely that we can convince any of the leading AI labs to start launching their AGI designs into space in the hope of driving the danger (as perceived by the AI) that the humans present to the AI so low that acting to extinguish that danger will itself be seen by the AI as even more dangerous.
This is assuming that the AI only care about being alive. For any utility function, we could make a non-linear transformation of it to make it risk adverse. E.g. we can transform it such that it can never take a value above 100, and that the default world (without the AI) has a value of 99.999. If we also give the case where an outside observer disapproves of the agent a value of 0, the AI would rather be shut down by humans than do something it know would be disapproved by the outside observer.