I’m wondering how, in principal, we should deal with malign priors. Specifically, I’m wondering what to do about the possibility that reality itself is, in a sense, malign.
I had previously said that it seems really hard to verifiably learn a non-malign prior. However, now I’ve realized that I’m not even sure what a non-malign, but still reliable, prior would even look like.
In previous discussion of malign priors, I’ve seen people talk about the AI misbehaving due to thinking it’s in some embedded in a simpler universe than our own that was controlled by agents trying to influence the AI’s predictions and thus decision. However, the issue is, even if the AI does form a correct understanding of the universe it’s actually in, it seems quite plausible to me that the AI’s predictions would still be malign.
I saw this because it sounds plausible to me that most agents experiencing what the first generally-intelligent AIs on Earth are actually in simulations, and the simulations could then be manipulated by whoever made them to influence the AIs predictions and actions.
For example, consider an AI learning a reward function. If it looks for the simplest, highest-prior probability models that output its observed rewards, even in this universe, it might conclude that it is in some booby-trapped simulation that rewards taking over the world and giving control to aliens.
So in this sense, even if the AIs are correct about being in our universe, the actual predictions the AIs would make about their future rewards, and the environment they’re in, would quite possibly be malign.
Now, you could try to deal with this by making the AI think that it’s in the actual, non-simulated Earth. However, it’s quite possible that, for almost all of the actual AIs, this is wrong. So the simulations of the AIs would also believe they weren’t in simulations. Which means that there would be many powerful AIs that are quite wrong about the nature of their world.
And having so many powerful AIs be so wrong sounds dangerous. As an example of how this could go wrong, imagine if some aliens proposed a bet with the AI: if you aren’t in a simulation, I’ll give you control of 1% of my world; if you are, you’ll give me 1% control of your world. If the AI was convinced it wasn’t in a simulation, I think it would take that bet. Then the bet could potentially be repeated until everything is controlled by the aliens.
One idea I had was to have the AI learn models that are in some sense “naive” that predicts percepts in some way that wouldn’t result in dangerous things like a malign prior would have. Then, make the AI believe that these models are just “naive” models of its percepts, rather than what’s actually going to happen in the AI’s environment. Then define what the AI should do based on the naive models.
In other words, the AI’s beliefs would simply be about logical statements of the form, “This ‘naive’ induction system, given the provided percepts, would have a next prediction of x”. And then you would use these logical statements to determine the AI’s behavior somehow.
This way, the AIs could potentially avoid issues with malign priors without having any beliefs that are actually wrong.
This seems like a pretty reasonable approach to me, but I’m interested in what others think. I haven’t seen this discussed before, but it might have been, and I would appreciate a link to any previous discussions.
I’m wondering how, in principal, we should deal with malign priors. Specifically, I’m wondering what to do about the possibility that reality itself is, in a sense, malign.
I had previously said that it seems really hard to verifiably learn a non-malign prior. However, now I’ve realized that I’m not even sure what a non-malign, but still reliable, prior would even look like.
In previous discussion of malign priors, I’ve seen people talk about the AI misbehaving due to thinking it’s in some embedded in a simpler universe than our own that was controlled by agents trying to influence the AI’s predictions and thus decision. However, the issue is, even if the AI does form a correct understanding of the universe it’s actually in, it seems quite plausible to me that the AI’s predictions would still be malign.
I saw this because it sounds plausible to me that most agents experiencing what the first generally-intelligent AIs on Earth are actually in simulations, and the simulations could then be manipulated by whoever made them to influence the AIs predictions and actions.
For example, consider an AI learning a reward function. If it looks for the simplest, highest-prior probability models that output its observed rewards, even in this universe, it might conclude that it is in some booby-trapped simulation that rewards taking over the world and giving control to aliens.
So in this sense, even if the AIs are correct about being in our universe, the actual predictions the AIs would make about their future rewards, and the environment they’re in, would quite possibly be malign.
Now, you could try to deal with this by making the AI think that it’s in the actual, non-simulated Earth. However, it’s quite possible that, for almost all of the actual AIs, this is wrong. So the simulations of the AIs would also believe they weren’t in simulations. Which means that there would be many powerful AIs that are quite wrong about the nature of their world.
And having so many powerful AIs be so wrong sounds dangerous. As an example of how this could go wrong, imagine if some aliens proposed a bet with the AI: if you aren’t in a simulation, I’ll give you control of 1% of my world; if you are, you’ll give me 1% control of your world. If the AI was convinced it wasn’t in a simulation, I think it would take that bet. Then the bet could potentially be repeated until everything is controlled by the aliens.
One idea I had was to have the AI learn models that are in some sense “naive” that predicts percepts in some way that wouldn’t result in dangerous things like a malign prior would have. Then, make the AI believe that these models are just “naive” models of its percepts, rather than what’s actually going to happen in the AI’s environment. Then define what the AI should do based on the naive models.
In other words, the AI’s beliefs would simply be about logical statements of the form, “This ‘naive’ induction system, given the provided percepts, would have a next prediction of x”. And then you would use these logical statements to determine the AI’s behavior somehow.
This way, the AIs could potentially avoid issues with malign priors without having any beliefs that are actually wrong.
This seems like a pretty reasonable approach to me, but I’m interested in what others think. I haven’t seen this discussed before, but it might have been, and I would appreciate a link to any previous discussions.