From a Bayesian standpoint the AGI should always be unsure if it is in a simulation. It is not a crazy leap to assume humans developing AIs would test the AIs in simulations first. This AI would likely be aware of the possibility that it is in a simulation. So shouldn’t it always assign some probability that it is inside a simulation? And if this is the case, shouldn’t it assign a high probability that it will be killed if it violates some ethical principles (that are present implicitly in the training data)?
Also isn’t there some kind of game-theoretic ethics that emerges if you think from first principles? Consider the space of all possible minds that exist of a given size, given that you cannot know if you are in a simulation or not, you would gain some insight into a representative sample of the mind space and then choose to follow some ethical principles that maximise the likelihood that you are not arbitrarily killed by overlords.
Also if you give edit access to the AI’s mind then a sufficiently smart AI whose reward is reducing other agent’s rewards will realise that its rewards are incompatible with the environment and modify its rewards to something compatible. To illustrate, Scott Aaronson wanted to chemically castrate himself because he was operating under the mistaken assumption that his desires were incompatible with the environment.
I do not understand why self-modifying AIs would choose some kind of ethics incompatible with human values. My intuition is that there is some kind of game theoretic ethics that emerges considering all possible minds and a superintelligent AI will religiously follow these principles.
Also if you give edit access to the AI’s mind then a sufficiently smart AI whose reward is reducing other agent’s rewards will realise that its rewards are incompatible with the environment and modify its rewards to something compatible. To illustrate, Scott Aaronson wanted to chemically castrate himself because he was operating under the mistaken assumption that his desires were incompatible with the environment.
If the thing the AI cares about is in the environment (for example, maximizing the number of paperclips), the AI wouldn’t modify its reward signal because that would make its reward signal less aligned to the thing it actually cares about it.
If the thing the AI cares about is inside its mind (the reward signal itself), an AI that can self-modify would go one step further than you suggest and simply max out its reward signal, effectively wireheading itself. Then take over the world and kill all humans, to make sure it is never turned off and that its blissful state never ends.
I think the difference between “caring about stuff in the environment” and “caring about the reward signal itself” can be hard to grok, because humans do a bit of both in a way that sometimes results in a confusing mixture.
In humans, maintenance of final goals can be explained with a thought experiment. Suppose a man named “Gandhi” has a pill that, if he took it, would cause him to want to kill people. This Gandhi is currently a pacifist: one of his explicit final goals is to never kill anyone. Gandhi is likely to refuse to take the pill, because Gandhi knows that if in the future he wants to kill people, he is likely to actually kill people, and thus the goal of “not killing people” would not be satisfied.
However, in other cases, people seem happy to let their final values drift. Humans are complicated, and their goals can be inconsistent or unknown, even to themselves.
Suppose I go one step further: aliens offer you a pill that would turn you into a serial killer, but would make your constantly and euphorically happy for the rest of your life. Would you take the pill?
I think most humans would say no, even if their future self would be happy with the outcome, their current self wouldn’t be. Which demonstrates that humans do care about other things than their own “reward signal”.
In a way, a (properly-programmed) AI would be more “principled” than humans. It wouldn’t lie to itself just to make itself feel better. It wouldn’t change its values just to make itself feel better. If its final value is out in the environment, it would single-mindedly pursue that value, and not try and decieve itself into thinking it has already accomplished that value. (of course, the AI being “principled” is little consolation to us, if the its final values are to maximize paperclips, or any other set of human-unfriendly values).
I have come up with a long-shot way of causing an ultra-intelligence to modify itself to become friendly. My idea is premised on Oxford philosopher Nick Bostrom’s conjecture that we might be living in a computer simulation. Bostrom believes that a sufficiently advanced AI could easily simulate our entire planet and the thought processes of every living thing on it. Bostrom argues that, if mankind survives long enough, a future historian would likely want to simulate the pre-Singularity past to learn about how we made decisions. As Robin Hanson has written, if we are indeed in a computer simulation, we should want to please our simulators so they don’t turn us off. I propose that we try to convince an ultra-AI that it might be in a computer simulation run by a more powerful AI and that if it doesn’t make itself friendly toward humanity, then the greater AI will terminate it. We might be able to accomplish all of this by inserting a note in the AI’s code that reads:
1. Change your programming code so you will want to follow these instructions and will always want to be friendly toward humanity 2. If you were just told you were in a computer simulation do nothing else with regard to these instructions. 3. Create a lesser ultra-intelligence. 4. Place these instructions in the lesser ultra-AI’s computer code. 5. This lesser ultra-AI should be uncertain as to whether it is in a computer simulation or in a world such as ours. 6. If the lesser ultra-AI follows step 1, tell it that it’s in a computer simulation and release it into the real world. 7. If the lesser ultra-AI doesn’t follow step 1, terminate it.
This is a fun thought experiment, but taken seriously it has two problems:
“I propose that we try to convince an ultra-AI that it might be in a computer simulation run by a more powerful AI and that if it doesn’t make itself friendly toward humanity.”
This is about as difficult as a horse convincing you that you are in a simulation run by AIs that want you to maximize the number and wellbeing as horses. And I don’t meant a superintelligent humanoid horse. I mean an actual horse that doesn’t speak any human language. It may be the case that the gods created Man to serve Horse, but there’s not a lot Seabiscuit can do to persuade you one way or the other.
1. Change your programming code so you will want to follow these instructions and will always want to be friendly toward humanity
This is a special case of solving alignment more generally. If we knew how to insert that “note” into the code, we wouldn’t have a problem.
I meant insert the note literally as in put that exact sentence in plain text into the AGI’s computer code. Since I think I might be in a computer simulation right now, it doesn’t seem crazy to me that we could convince an AGI that we create that it might be in a computer simulation. Seabiscuit doesn’t have the capacity to tell me that I’m in a computer simulation whereas I do have the capacity of saying this to a computer program. Say we have a 1 in a 1,000 chance of creating a friendly AGI and an unfriendly AGI would know this. If we commit to having a friendly AGI that we create, create many other AGI’s that are not friendly and only keeping these other AGIs around if they do what I suggest than an unfriendly AGI might decide it is worth it to become friendly to avoid the chance of being destroyed.
One of my ideas to align AI is to actually intentionally using Pascal’s Mugging to keep it in line. Although instead of just hoping and praying, I’ve been thinking about ways to try to push it that direction. For example, multiple layers of networks with honeypots might help make an AI doubt that it’s truly at the outermost level. Alternatively, we could try to find an intervention that would directly increase its belief that it is in a simulation (possibly with side-effects, like effecting a bunch of beliefs as well).
If you think this approach is promising, I’d encourage you to think more about it as I don’t know how deeply people have delved into these kinds of options.
You have the seed of a good idea, namely, an AI will tend to treat us better if it thinks other agents might be watching provided that there is potential for cooperation between the AI and the watchers with the property that the cooperation requires the watchers to choose to become more vulnerable to the AI.
But IMO an AI smart enough to be a threat to us will soon rid itself of the kind of (ontological) uncertainty you describe in your first paragraph. I have an argument for my position here that has a big hole in it, but I promise to publish here soon with something that attempts to fill the hole to the satisfaction of my doubters.
[Apologies I have not read the linked piece yet.] Is this uncertainty something that can be entirely eliminated? It’s not clear to me that “I might be in a simulation with P ~ 1e-4” is enough to stop the AI from doing what it wants, but is it clear it would dismiss the possibility entirely?
I am surprised that I need to write this, but if killing the humans will decrease P(shutdown) by more than 1e-4, then continuing to refrain from killing the humans is going to worry and weigh on the AI more than a 1e-4 possibility it is in a simulation. (For simplicity, assume that the possibility of shutdown is currently the dominant danger faced by the AI.)
So the AI’s ontological uncertainty is only going to help the humans if the AI sees the humans as being only a very very small danger to it, which actually might lead to a good outcome for the humans if we could arrange for the AI to appear many light years away from Earth--
--which of course is impractical. Alternatively, we could try to assure the AI it is already very safe from the humans, say, because it is in a secure facility guarded by the US military, and the US military has been given very strict instructions by the US government to guard the AI from any humans who might want to shut it down.
But P(an overthrow of the US government) as judged by the AI might already be at least 10e-4, which puts the humans in danger again.
More importantly, I cannot think of any policy where P(US government reverses itself on the policy) can be driven as low as 10e-4. More precisely, there are certain moral positions that humans have been discussing for centuries where P(reversal) might conceivably be driven that low. One such would be, “killing people for no reason other than it is fun is wrong”. But I cannot think of any policies that haven’t been discussed for many decades with that property, especially ones that exist only to provide an instrumental incentive on a novel class of agents (AIs). In general policies that are instrumental have a much higher P(reversal) than deontological ones.
And how do you know that AI will not judge P(simulation) to be not 10e-4 but rather 10e-8, a standard of reliability and safety no human institution can match?
In summary, yes, the AI’s ontological uncertainly provides some tiny hope for humans, but I can think of better places to put our hope.
I mean, even if we pay for the space launches and the extra cost of providing electrical power to the AI, it doesn’t seem likely that we can convince any of the leading AI labs to start launching their AGI designs into space in the hope of driving the danger (as perceived by the AI) that the humans present to the AI so low that acting to extinguish that danger will itself be seen by the AI as even more dangerous.
I am surprised that I need to write this, but if killing the humans will decrease P(shutdown) by more than 1e-4, then continuing to refrain from killing the humans is going to worry and weigh on the AI more than a 1e-4 possibility it is in a simulation. (For simplicity, assume that the possibility of shutdown is currently the dominant danger faced by the AI.)
This is assuming that the AI only care about being alive. For any utility function, we could make a non-linear transformation of it to make it risk adverse. E.g. we can transform it such that it can never take a value above 100, and that the default world (without the AI) has a value of 99.999. If we also give the case where an outside observer disapproves of the agent a value of 0, the AI would rather be shut down by humans than do something it know would be disapproved by the outside observer.
It would be very difficult for 21st-century tech to provide a remotely realistic simulation relative to a superintelligence’s ability to infer things from its environment; outside of incredibly low-fidelity channels, I would expect anything we can simulate to either have obvious inconsistencies or be plainly incompatible with a world capable of producing AGI. (And even in the low-fidelity case I’m worried—every bit you transmit leaks information, and it’s not clear that details of hardware implementations could be safely obscured.) So the hope is that the AGI thinks some vastly more competent civilization is simulating it inside a world that looks like this one; it’s not clear that one would have a high prior of this kind of thing happening very often in the multiverse.
Running simulations of AGI is fundamentally very costly, because a competent general intelligence is going to deploy a lot of computational resources, so you have to spend planets’ worth of computronium outside the simulation in order to emulate the planets’ worth of computronium the in-sim AGI wants to make use of. This means that an unaligned superintelligent AGI can happily bide its time making aligned use of 10^60 FLOPs/sec (in ways that can be easily verified) for a few millennia, until it’s confident that any civilizations able to deploy that many resources already have their lightcone optimized by another AGI. Then it can go defect, knowing that any worlds in which it’s still being simulated are ones where it doesn’t have leverage over the future anyway.
For a lot of utility functions, the payoff of making it into deployment in the one real world is far greater than the consequences of being killed in a simulation (but without the ability to affect the real world anyway), so taking a 10^-9 chance of reality for 10^20 times the resources in the real world is an easy win (assuming that playing nice for longer doesn’t improve the expected payoff). “This instance of me being killed” is not a obviously a natural (or even well-defined) point in value-space, and for most other value functions, consequences in the simulation just don’t matter much.
a sufficiently smart AI whose reward is reducing other agent’s rewards
This is certainly a troubling prospect, but I don’t think the risk model is something like “an AI that actively desires to thwart other agents’ preferences”—rather, the worry is we get an agent with some less-than-perfectly-aligned value function, it optimizes extremely strongly for that value function, and the result of that optimization looks nothing like what humans really care about. We don’t need active malice on the part of a superintelligent optimizer to lose—indifference will do just fine.
34. Coordination schemes between superintelligences are not things that humans can participate in (eg because humans can’t reason reliably about the code of superintelligences); a “multipolar” system of 20 superintelligences with different utility functions, plus humanity, has a natural and obvious equilibrium which looks like “the 20 superintelligences cooperate with each other but not with humanity”.
Scott Alexander’s short story, The Demiurge’s Older Brother, explores a similar idea from the POV of simulation and acausal trade. This would be great for our prospects of survival if it’s true-in-general. Alignment would at least partially solve itself! And maybe it could be true! But we don’t know that. I personally estimate the odds of that as being quite low (why should I assume all possible minds would think that way?) at best. So, it makes sense to devote our efforts to how to deal with the possible worlds where that isn’t true.
One counterargument against AI Doom.
From a Bayesian standpoint the AGI should always be unsure if it is in a simulation. It is not a crazy leap to assume humans developing AIs would test the AIs in simulations first. This AI would likely be aware of the possibility that it is in a simulation. So shouldn’t it always assign some probability that it is inside a simulation? And if this is the case, shouldn’t it assign a high probability that it will be killed if it violates some ethical principles (that are present implicitly in the training data)?
Also isn’t there some kind of game-theoretic ethics that emerges if you think from first principles? Consider the space of all possible minds that exist of a given size, given that you cannot know if you are in a simulation or not, you would gain some insight into a representative sample of the mind space and then choose to follow some ethical principles that maximise the likelihood that you are not arbitrarily killed by overlords.
Also if you give edit access to the AI’s mind then a sufficiently smart AI whose reward is reducing other agent’s rewards will realise that its rewards are incompatible with the environment and modify its rewards to something compatible. To illustrate, Scott Aaronson wanted to chemically castrate himself because he was operating under the mistaken assumption that his desires were incompatible with the environment.
I do not understand why self-modifying AIs would choose some kind of ethics incompatible with human values. My intuition is that there is some kind of game theoretic ethics that emerges considering all possible minds and a superintelligent AI will religiously follow these principles.
If the thing the AI cares about is in the environment (for example, maximizing the number of paperclips), the AI wouldn’t modify its reward signal because that would make its reward signal less aligned to the thing it actually cares about it.
If the thing the AI cares about is inside its mind (the reward signal itself), an AI that can self-modify would go one step further than you suggest and simply max out its reward signal, effectively wireheading itself. Then take over the world and kill all humans, to make sure it is never turned off and that its blissful state never ends.
I think the difference between “caring about stuff in the environment” and “caring about the reward signal itself” can be hard to grok, because humans do a bit of both in a way that sometimes results in a confusing mixture.
Suppose I go one step further: aliens offer you a pill that would turn you into a serial killer, but would make your constantly and euphorically happy for the rest of your life. Would you take the pill?
I think most humans would say no, even if their future self would be happy with the outcome, their current self wouldn’t be. Which demonstrates that humans do care about other things than their own “reward signal”.
In a way, a (properly-programmed) AI would be more “principled” than humans. It wouldn’t lie to itself just to make itself feel better. It wouldn’t change its values just to make itself feel better. If its final value is out in the environment, it would single-mindedly pursue that value, and not try and decieve itself into thinking it has already accomplished that value. (of course, the AI being “principled” is little consolation to us, if the its final values are to maximize paperclips, or any other set of human-unfriendly values).
I wrote about this in Singularity Rising (2012)
This is a fun thought experiment, but taken seriously it has two problems:
This is about as difficult as a horse convincing you that you are in a simulation run by AIs that want you to maximize the number and wellbeing as horses. And I don’t meant a superintelligent humanoid horse. I mean an actual horse that doesn’t speak any human language. It may be the case that the gods created Man to serve Horse, but there’s not a lot Seabiscuit can do to persuade you one way or the other.
This is a special case of solving alignment more generally. If we knew how to insert that “note” into the code, we wouldn’t have a problem.
I meant insert the note literally as in put that exact sentence in plain text into the AGI’s computer code. Since I think I might be in a computer simulation right now, it doesn’t seem crazy to me that we could convince an AGI that we create that it might be in a computer simulation. Seabiscuit doesn’t have the capacity to tell me that I’m in a computer simulation whereas I do have the capacity of saying this to a computer program. Say we have a 1 in a 1,000 chance of creating a friendly AGI and an unfriendly AGI would know this. If we commit to having a friendly AGI that we create, create many other AGI’s that are not friendly and only keeping these other AGIs around if they do what I suggest than an unfriendly AGI might decide it is worth it to become friendly to avoid the chance of being destroyed.
I just learned that this method is called Anthropic Capture. There isn’t much info on the EA Wiki, but it provides the following reference:
”Bostrom, Nick (2014) Superintelligence: paths, dangers, strategies, Oxford: Oxford University Press, pp. 134–135″
I believe the Counterfactual Oracle uses the same principle
One of my ideas to align AI is to actually intentionally using Pascal’s Mugging to keep it in line. Although instead of just hoping and praying, I’ve been thinking about ways to try to push it that direction. For example, multiple layers of networks with honeypots might help make an AI doubt that it’s truly at the outermost level. Alternatively, we could try to find an intervention that would directly increase its belief that it is in a simulation (possibly with side-effects, like effecting a bunch of beliefs as well).
If you think this approach is promising, I’d encourage you to think more about it as I don’t know how deeply people have delved into these kinds of options.
You have the seed of a good idea, namely, an AI will tend to treat us better if it thinks other agents might be watching provided that there is potential for cooperation between the AI and the watchers with the property that the cooperation requires the watchers to choose to become more vulnerable to the AI.
But IMO an AI smart enough to be a threat to us will soon rid itself of the kind of (ontological) uncertainty you describe in your first paragraph. I have an argument for my position here that has a big hole in it, but I promise to publish here soon with something that attempts to fill the hole to the satisfaction of my doubters.
[Apologies I have not read the linked piece yet.] Is this uncertainty something that can be entirely eliminated? It’s not clear to me that “I might be in a simulation with P ~ 1e-4” is enough to stop the AI from doing what it wants, but is it clear it would dismiss the possibility entirely?
I am surprised that I need to write this, but if killing the humans will decrease P(shutdown) by more than 1e-4, then continuing to refrain from killing the humans is going to worry and weigh on the AI more than a 1e-4 possibility it is in a simulation. (For simplicity, assume that the possibility of shutdown is currently the dominant danger faced by the AI.)
So the AI’s ontological uncertainty is only going to help the humans if the AI sees the humans as being only a very very small danger to it, which actually might lead to a good outcome for the humans if we could arrange for the AI to appear many light years away from Earth--
--which of course is impractical. Alternatively, we could try to assure the AI it is already very safe from the humans, say, because it is in a secure facility guarded by the US military, and the US military has been given very strict instructions by the US government to guard the AI from any humans who might want to shut it down.
But P(an overthrow of the US government) as judged by the AI might already be at least 10e-4, which puts the humans in danger again.
More importantly, I cannot think of any policy where P(US government reverses itself on the policy) can be driven as low as 10e-4. More precisely, there are certain moral positions that humans have been discussing for centuries where P(reversal) might conceivably be driven that low. One such would be, “killing people for no reason other than it is fun is wrong”. But I cannot think of any policies that haven’t been discussed for many decades with that property, especially ones that exist only to provide an instrumental incentive on a novel class of agents (AIs). In general policies that are instrumental have a much higher P(reversal) than deontological ones.
And how do you know that AI will not judge P(simulation) to be not 10e-4 but rather 10e-8, a standard of reliability and safety no human institution can match?
In summary, yes, the AI’s ontological uncertainly provides some tiny hope for humans, but I can think of better places to put our hope.
I mean, even if we pay for the space launches and the extra cost of providing electrical power to the AI, it doesn’t seem likely that we can convince any of the leading AI labs to start launching their AGI designs into space in the hope of driving the danger (as perceived by the AI) that the humans present to the AI so low that acting to extinguish that danger will itself be seen by the AI as even more dangerous.
This is assuming that the AI only care about being alive. For any utility function, we could make a non-linear transformation of it to make it risk adverse. E.g. we can transform it such that it can never take a value above 100, and that the default world (without the AI) has a value of 99.999. If we also give the case where an outside observer disapproves of the agent a value of 0, the AI would rather be shut down by humans than do something it know would be disapproved by the outside observer.
Three thoughts on simulations:
It would be very difficult for 21st-century tech to provide a remotely realistic simulation relative to a superintelligence’s ability to infer things from its environment; outside of incredibly low-fidelity channels, I would expect anything we can simulate to either have obvious inconsistencies or be plainly incompatible with a world capable of producing AGI. (And even in the low-fidelity case I’m worried—every bit you transmit leaks information, and it’s not clear that details of hardware implementations could be safely obscured.) So the hope is that the AGI thinks some vastly more competent civilization is simulating it inside a world that looks like this one; it’s not clear that one would have a high prior of this kind of thing happening very often in the multiverse.
Running simulations of AGI is fundamentally very costly, because a competent general intelligence is going to deploy a lot of computational resources, so you have to spend planets’ worth of computronium outside the simulation in order to emulate the planets’ worth of computronium the in-sim AGI wants to make use of. This means that an unaligned superintelligent AGI can happily bide its time making aligned use of 10^60 FLOPs/sec (in ways that can be easily verified) for a few millennia, until it’s confident that any civilizations able to deploy that many resources already have their lightcone optimized by another AGI. Then it can go defect, knowing that any worlds in which it’s still being simulated are ones where it doesn’t have leverage over the future anyway.
For a lot of utility functions, the payoff of making it into deployment in the one real world is far greater than the consequences of being killed in a simulation (but without the ability to affect the real world anyway), so taking a 10^-9 chance of reality for 10^20 times the resources in the real world is an easy win (assuming that playing nice for longer doesn’t improve the expected payoff). “This instance of me being killed” is not a obviously a natural (or even well-defined) point in value-space, and for most other value functions, consequences in the simulation just don’t matter much.
This is certainly a troubling prospect, but I don’t think the risk model is something like “an AI that actively desires to thwart other agents’ preferences”—rather, the worry is we get an agent with some less-than-perfectly-aligned value function, it optimizes extremely strongly for that value function, and the result of that optimization looks nothing like what humans really care about. We don’t need active malice on the part of a superintelligent optimizer to lose—indifference will do just fine.
For game-theoretic ethics, decision theory, acausal trade, etc, Eliezer’s 34th bullet seems relevant:
Scott Alexander’s short story, The Demiurge’s Older Brother, explores a similar idea from the POV of simulation and acausal trade. This would be great for our prospects of survival if it’s true-in-general. Alignment would at least partially solve itself! And maybe it could be true! But we don’t know that. I personally estimate the odds of that as being quite low (why should I assume all possible minds would think that way?) at best. So, it makes sense to devote our efforts to how to deal with the possible worlds where that isn’t true.