I have some concerns about an impact measure proposed here. I’m interested on working on impact measures, and these seem like very serious concerns to me, so it would be helpful seeing what others think about them. I asked Stuart, one of the authors, about these concerns, but he said it was too busy to work on dealing with them.
First, I’ll give a basic description of the impact measure. Have your AI be turned on from some sort of stochastic process that may or may not result in the AI being turned on. For example, consider sending a photo through a semi-silvered mirror. Let X be the statement, “The AI was turned on”. For a wide range of variables V, try to make the probability distribution P(Vi|X) as similar as possible to the probability distribution P(Vi|¬X).
Here are my concerns:
It’s not clear to me that we can find any verifiably-stochastic process. Some interpretations of quantum physics say that there are some latent variables that determine the outcome of seemingly stochastic processes. In this case, just knowing the AI was or was not turned on may have thus implications on latent variables that possibly have large effects on the world.
Further, I can’t find a way to define X and not-X that works, even if you find a truly stochastic process. You could try to specify X as, “the photon makes it through the semi-silvered mirror” and specify not-X as the negation of this. But conditioning on this and nothing else is insufficient. To see why, not that X and not-X don’t specify anything about the mechanics about how the photon and mirror interact. For example, X, above doesn’t rule out universes in which there is no quantum physics as we know it, and the mirror and photon interact in some completely different way that makes the photon more likely to pass through than it would in our universe. Thus, conditioning on X would make it more likely the AI would be in that universe, which could have large effects on the values of variables.
You could try to get around this by also including in X and not-X mechanics describing how the photon and semi-silvered mirror interact. The issue is, we don’t actually know exactly how the photon and semi-silvered mirror interact. We have models of quantum physics that have extremely high predictive accuracy in practice, but we still don’t have a theory of everything. So we could include in X and not-X our current understanding of quantum physics, but that understanding might be somewhat wrong.
This could potentially be dangerous. Suppose the AI has invented some technology that, according to the actual, correct theory of everything, would destroy the world. But according to our slightly wrong model of quantum physics, it would do nothing in particular. Then the AI, having X and not-X condition on our current model of quantum physics, would see no problem with creating the technology.
Further still, I still have the concern about what to do if the AI in fact doesn’t turn on. You could decide to just turn on some AI identical to it, and keep trying turning on AIs until you get an one to turn on. But then the world would pretty much be identical whether X or not-X turns out to be true, because the other AI would do the same thing it would have.
You could instead turn on some AI that uses the same reduced-impact technique, but is different from the current AI in some way that significantly effects the world. However, if you do this, then I think we might arrive in a very bad equilibrium in the AIs’ behavior. Specifically, suppose all AIs decide to destroy the world if they’re turned on. Then that means that if one of the AIs isn’t turned on, some other AI will be turned on, which would also lead to the world being destroyed. If all the destroyed worlds look pretty much the same, then the AI deciding to destroy the world would have low impact according to your definition.
So suppose instead that if the AI doesn’t turn on, we rely on some other AI safety technique. Suppose this AI safety technique successfully makes safe AI. In this case, we had no need to create the reduced-impact AI you described, because the other AI design would have worked fine. Suppose the alternative AI safety technique in fact doesn’t work, and the result of using it is that our world gets destroyed. In that case, the conditioning on not-X would imply the world would be destroyed. So if an AI tries to make the world conditioning on X similar to it conditioning on not-X, then it would again act to destroy the world.
Luke responded to one of my points, saying, “Yes. [what to do if the AI doesn’t get turned on] needs to be handled with care. A time-delay might be sufficient for this, though. My response was:
My concern with time-delays is that the AI might make some sort of time-bomb-esque device. For example, suppose the AI knows that if it wasn’t turned on, another AI like it would have been turned on three months later. Then both AIs could decide to make a “time bomb” set to go off four months after the first AI was scheduled to potentially turn on. And when it goes off, the “time bomb” might destroy the world, convert it into a mass of AI-controlled nanotechnology, or do some other bad thing. This way, neither AI would actually change the world relative to if it was never turned on.
I have some concerns about an impact measure proposed here. I’m interested on working on impact measures, and these seem like very serious concerns to me, so it would be helpful seeing what others think about them. I asked Stuart, one of the authors, about these concerns, but he said it was too busy to work on dealing with them.
First, I’ll give a basic description of the impact measure. Have your AI be turned on from some sort of stochastic process that may or may not result in the AI being turned on. For example, consider sending a photo through a semi-silvered mirror. Let X be the statement, “The AI was turned on”. For a wide range of variables V, try to make the probability distribution P(Vi|X) as similar as possible to the probability distribution P(Vi|¬X).
Here are my concerns:
It’s not clear to me that we can find any verifiably-stochastic process. Some interpretations of quantum physics say that there are some latent variables that determine the outcome of seemingly stochastic processes. In this case, just knowing the AI was or was not turned on may have thus implications on latent variables that possibly have large effects on the world.
Further, I can’t find a way to define X and not-X that works, even if you find a truly stochastic process. You could try to specify X as, “the photon makes it through the semi-silvered mirror” and specify not-X as the negation of this. But conditioning on this and nothing else is insufficient. To see why, not that X and not-X don’t specify anything about the mechanics about how the photon and mirror interact. For example, X, above doesn’t rule out universes in which there is no quantum physics as we know it, and the mirror and photon interact in some completely different way that makes the photon more likely to pass through than it would in our universe. Thus, conditioning on X would make it more likely the AI would be in that universe, which could have large effects on the values of variables.
You could try to get around this by also including in X and not-X mechanics describing how the photon and semi-silvered mirror interact. The issue is, we don’t actually know exactly how the photon and semi-silvered mirror interact. We have models of quantum physics that have extremely high predictive accuracy in practice, but we still don’t have a theory of everything. So we could include in X and not-X our current understanding of quantum physics, but that understanding might be somewhat wrong.
This could potentially be dangerous. Suppose the AI has invented some technology that, according to the actual, correct theory of everything, would destroy the world. But according to our slightly wrong model of quantum physics, it would do nothing in particular. Then the AI, having X and not-X condition on our current model of quantum physics, would see no problem with creating the technology.
Further still, I still have the concern about what to do if the AI in fact doesn’t turn on. You could decide to just turn on some AI identical to it, and keep trying turning on AIs until you get an one to turn on. But then the world would pretty much be identical whether X or not-X turns out to be true, because the other AI would do the same thing it would have.
You could instead turn on some AI that uses the same reduced-impact technique, but is different from the current AI in some way that significantly effects the world. However, if you do this, then I think we might arrive in a very bad equilibrium in the AIs’ behavior. Specifically, suppose all AIs decide to destroy the world if they’re turned on. Then that means that if one of the AIs isn’t turned on, some other AI will be turned on, which would also lead to the world being destroyed. If all the destroyed worlds look pretty much the same, then the AI deciding to destroy the world would have low impact according to your definition.
So suppose instead that if the AI doesn’t turn on, we rely on some other AI safety technique. Suppose this AI safety technique successfully makes safe AI. In this case, we had no need to create the reduced-impact AI you described, because the other AI design would have worked fine. Suppose the alternative AI safety technique in fact doesn’t work, and the result of using it is that our world gets destroyed. In that case, the conditioning on not-X would imply the world would be destroyed. So if an AI tries to make the world conditioning on X similar to it conditioning on not-X, then it would again act to destroy the world.
Luke responded to one of my points, saying, “Yes. [what to do if the AI doesn’t get turned on] needs to be handled with care. A time-delay might be sufficient for this, though. My response was: