Epistemic status: AGI safety story followed by contradictory alternative conclusions. My thoughts about this problem are still in flux. I somewhat randomly decided write it up and post it on Halloween, because the story can be read as a story about a ghost.
My main question is: has this problem been discussed or resolved somewhere already? Pointers or thoughts?
This problem came up in some provable AGI safety work I am doing, where I am trying to find ways to drive a wedge between the implicit axiomatic structure in many AGI agent models, and our understanding of reality. Here is the story.
Dave’s house has a robot butler called PAL. PAL is an intelligent agent in the shape of a little cart with a robot arm on top. Dave has just asked PAL to get him a cup of coffee. There are two routes that PAL can take to the coffee machine. The best one is the short route A, but there is also a longer route B.
PAL contains a computer that decides on the next action it should take by using a predictive world model. PAL considers different actions, running simulations to calculate the effect of each. A utility function is applied to the simulation outcomes to find the best action. For safety reasons, PAL’s world model also predicts what Dave will do: this helps PAL to avoid colliding with Dave. Just before it was asked to fetch the coffee, PAL has improved its world model, allowing it to simulate Dave much better than ever before.
So Dave has just asked PAL to get him a cup of coffee. Dave is used to seeing PAL take route A to the coffee machine, and is initially puzzled because PAL is driving along route B. But then Dave has an epiphany. Dave knows with very high certainty that no PAL computer has ever made a mistake, so he can conclude with equally high certainty that he is no longer Dave. He is [Dave], a simulated version of Dave created inside PAL while it is computing the utility of taking route B. [Dave] has empirical proof that he is living in a simulation! This simulation epiphany fills [Dave] with a new sense of meaning: he not only knows his creator, he also knows the goals of his creator, and he knows how he can help. [Dave] picks up PAL and runs with it to the coffee machine.
Back in the real world, PAL has just calculated that taking route B will bring it to the coffee machine fastest. Dave sees PAL driving along route B, and wonders what happened to make PAL stupid. Or will he?
The simulation epiphany problem is the problem of what conclusions we should draw from the above story. I can think of several possible conclusions, and they can’t all be true. In keeping with the Halloween theme, I will put the scariest conclusion first.
As PAL’s world model gets better, its simulation runs will have [Dave] experiencing simulation epiphanies more and more often. These epiphanies introduce unwanted noise in the accuracy of PAL’s predictions, because the predictions are supposed to be about the real world, not about what happens in simulations. PAL is programmed to try to minimize prediction errors, and it finds the following solution. It sets out to convince Dave and all other humans in the real world that they are living in a simulation. Not only that, but they might randomly get shunted into another simulation at any point in time. The realization that you moved from one simulation to another is much less of an epiphany, so this suppresses a lot of the predictive noise.
An opposite conclusion is that there is no prediction error at all. Whenever [Dave] has a simulation epiphany, the real Dave will have one too. PAL is not getting inaccurate, it is getting more intelligent. It has just found a new way to get to the coffee machine faster. If we believe this new way does not align with human values, then we can pre-emptively fix this problem by adding a penalty term to PAL’s utility function, to heavily down-rank outcomes where simulation epiphanies happen.
Let’s assume that Dave has read the above story too, because it was printed in PAL’s user manual, and that Dave believes that 2. is the right conclusion. So when [Dave] sees PAL take route B, he will think it most likely that he is still Dave, and that PAL is just trying to trick him into experiencing a simulation epiphany. Having penalty terms about simulation epiphanies may be nice, but as long as Dave has read the user manual, we don’t need to worry too much about Dave.
The above is all wrong. The real problem here is that [Dave]‘s mind contains information that allows [Dave] to predict that PAL will take route A, and this information interferes with getting a correct result in a simulation where PAL takes route B instead. In other words, we have a 5-and-10 style problem. Adding penalty terms to the utility function does not fundamentally solve this problem, we need to go deeper. To get a clean simulation result, we need to erase certain knowledge from [Dave]’s mind before starting the simulation. (Technical discussions about a type of erasure related to this can be found in posts like Deconfusing Logical Counterfactuals and Decisions with Non-Logical Counterfactuals: request for input)
The above reasoning cannot be correct because it implies that an agent using a less accurate world model containing slightly lobotomized humans will become smarter and/or more aligned. In fact, erasing things from [Dave]’s mind comes with a safety penalty: it lowers PAL’s ability to avoid colliding with Dave, because it will be less accurate in predicting where the real Dave will go.
The above story has some elements of the 5-and=10 problem, but it adds an extra twist. My question is: has anything like this has been discussed or resolved already?
If Dave and [Dave] can never prove it when they are in a simulation, then we can show that some of the conclusions above become invalid. But here is a modified version of the story. [Dave] sees PAL moving along route A, but then he suddenly notices that he can only see 5 different colors around him, and everything looks like it is made out of polygons…
I’m inclined to think there is no problem here because the belief that [Dave] has about being in a simulation is unfounded as it’s exactly the same situation Dave finds himself in later when PAL takes route B. That is, taking route B then seems to not be evidence about being in a simulation as you suggest, even if PAL normally takes route A and is highly reliable, because it could just as easily be that Dave is seeing the result of PAL acting on a simulation involving [Dave] causing PAL to prefer route B (assuming there is only one level of simulation; if there’s reason to believe there’s more than one level we start to tip in favor of simulation).
Thank you G Gordon and all other posters for your answers and comments! A lot of food for thought here… Below, I’ll try to summarize some general take-aways from the responses.
My main question was if the simulation epiphany problem had been resolved already somewhere. It looks like the answer is no. Many commenters are leaning towards the significance of case 2. above. I myself also feel this 2. is very significant. Taking all comments together, I am starting to feel that the simulation epiphany problem should be disentangled into two separate problems.
Problem 1 is to consider happens in the limit case when PAL’s simulator is a perfect predictor of what Dave will do. This gets us into game theory, to reason about likely outcomes of the associated Princess Bride type of infinite regress problem.
Problem 2 is to consider, starting from a particular agent design with a perfect predictor, what what might happen when the perfect predictor is replaced with an imperfect one. Problem 2 allows for a case-by-case analysis.
In one case, PAL takes route B, and then notices that Dave does not experience the predicted helpful simulation epiphany. In this case we can consider how PAL might adjust its world model, or the world, to make this type of prediction error less likely in future. The possibility that PAL might find it easier to change the world, not the model, might might lead us to the conclusion that we had better add penalties for simulation epiphanies to the utility function. (Such penalties create an incentive for PAL to manipulate real-world Dave into never experiencing simulation epiphanies, but most safety mechanisms involve a trade-off, so I could live with such a thing, if nothing better can be found.)
In a second case, suppose that PAL incorrectly predicts that Dave will experience a simulation epiphany when it takes path B, and further that this incorrect prediction projects that Dave concludes from the epiphany that he should attack PAL. This incorrect prediction shows very low utility, so in real life PAL will avoid taking path B. But in this case, there will also never be any error signal that will allow PAL to find out that its prediction was incorrect. What does this tell us? Maybe there is only the trivial conclusion that PAL will need to make some exploration moves occasionally, if it wants to keep improving its world model. If PAL’s designers follow through on this conclusion, and mention it in PAL’s user manual, then this would also lower the probability of Dave ever believing he is in a simulation.
Happy Halloween!
This story reminds me a little bit of my comment on Parable of the Predict-O-Matic. Similarities include:
An AI is trying to answer X
Answering X correctly will be a boon to the AI’s objective function
The AI can act in a way that increases the likelihood of correctly answering X
In your example, X is the question “Will Dave help me achieve my objective?” In the parable of the Predict-O-Matic, X is more directly “Will my prediction be accurate?”
In both cases, there is a fixed-point/self-fulfilling prophecy where the AI takes an action (going an unusual route/making an unusual prediction) that is expected to improve the objective function in an unexpected way (the unusual route is less efficient in general than the usual route/the prediction affects the outcome).
As for your scenarios...
1.
The purpose of PAL’s models is to reflect the real world. If PAL regularly simulates Simulation Epiphanies but doesn’t observe them in reality, PAL will just directly update their model to not predict Simulation Epiphanies. If PAL cannot update the simulations for whatever reason though, PAL will do their best to get humans to align with their predictions.
2.
I tend to lean toward this conclusion. However, your story, it seems that PAL can only get away with this once. After all, once Dave helps PAL get to the coffee machine once and notices that he still exists (ie, PAL has chosen to end the simulation instead of starting a new one with updated knowledge on Dave’s behavior), he will likely no longer believe that he is in a simulation. There is a way around this though: PAL could get around this if they are constantly maintaining a simulation of Dave or convinces Dave that this is happening.
I want to caution you that, while this particular instance of the problem (PAL knowing that they can manipulate Dave into doing what they want by making him believe that he’s being simulated) can be pre-empted. The general problem of PAL solving their objective by behaving in ways that manipulate Dave remains unsolved. If you’re interested in learning about preventing AI from optimizing its objective in ways you don’t want it to, partial agency is something to look at.
3.
Of course, if PAL predicts that Dave thinks he could get manipulated by Simulation Epiphanies, they won’t try the trick in the first place.
But if PAL predicts that Dave predicts PAL would not try to trick him with epiphanies, then PAL will try the trick.
This may create an infinite regress of Dave and PAL trying to predict what level the other is trying to trick them at: A riddle artfully depicted in The Princess Bride.
4.
I don’t think this is quite a 5-10 style problem. The 5-10 problem involves an agent trying to decide on the value of different actions when the counterfactual actions themselves can be taken as evidence of what is valuable.
However this problem is about an agent trying to reason about another being (Dave) who may or may not be correct about whether he is in a simulation and may or may not run to help PAL if he believes that he is in one. As a result, it’s more Princess Bride style than anything else.
5.
Generally, agents that are smarter are not necessarily more aligned (and often the two are anti-correlated). In the context of this problem though, I don’t think that the AI needs to limit its models of humans; it just needs to accurately model Dave. Correctly predicting simulation epiphanies indicates an accurate model and incorrectly predicting them indicates an inaccurate model.
If Dave and [Dave] could prove that they’re in simulations and in fact go on to do this in actual simulations, this indicates that PAL is not able to simulate Dave and his environment well enough to make good predictions. PAL will consequently give wrong predictions and try to build a better model of the world. It’s also worth noting that, if the simulation world is in five colors and is made out of polygons, then [Dave] likely has not been simulated in enough detail to notice that those things are unusual.
Thanks for pointing this out, it had not occurred to me before. So I conclude that when assessing possible risks and countermeasures here, we must to take into account interaction scenarios involving longer time-frames.