This story reminds me a little bit of my comment on Parable of the Predict-O-Matic. Similarities include:
An AI is trying to answer X
Answering X correctly will be a boon to the AI’s objective function
The AI can act in a way that increases the likelihood of correctly answering X
In your example, X is the question “Will Dave help me achieve my objective?” In the parable of the Predict-O-Matic, X is more directly “Will my prediction be accurate?”
In both cases, there is a fixed-point/self-fulfilling prophecy where the AI takes an action (going an unusual route/making an unusual prediction) that is expected to improve the objective function in an unexpected way (the unusual route is less efficient in general than the usual route/the prediction affects the outcome).
As for your scenarios...
1.
As PAL’s world model gets better, its simulation runs will have [Dave] experiencing simulation epiphanies more and more often. These epiphanies introduce unwanted noise in the accuracy of PAL’s predictions, because the predictions are supposed to be about the real world, not about what happens in simulations.
The purpose of PAL’s models is to reflect the real world. If PAL regularly simulates Simulation Epiphanies but doesn’t observe them in reality, PAL will just directly update their model to not predict Simulation Epiphanies. If PAL cannot update the simulations for whatever reason though, PAL will do their best to get humans to align with their predictions.
2.
An opposite conclusion is that there is no prediction error at all. Whenever [Dave] has a simulation epiphany, the real Dave will have one too. PAL is not getting inaccurate, it is getting more intelligent. It has just found a new way to get to the coffee machine faster.
I tend to lean toward this conclusion. However, your story, it seems that PAL can only get away with this once. After all, once Dave helps PAL get to the coffee machine once and notices that he still exists (ie, PAL has chosen to end the simulation instead of starting a new one with updated knowledge on Dave’s behavior), he will likely no longer believe that he is in a simulation. There is a way around this though: PAL could get around this if they are constantly maintaining a simulation of Dave or convinces Dave that this is happening.
If we believe this new way does not align with human values, then we can pre-emptively fix this problem by adding a penalty term to PAL’s utility function, to heavily down-ranks outcomes where simulation epiphanies happen.
I want to caution you that, while this particular instance of the problem (PAL knowing that they can manipulate Dave into doing what they want by making him believe that he’s being simulated) can be pre-empted. The general problem of PAL solving their objective by behaving in ways that manipulate Dave remains unsolved. If you’re interested in learning about preventing AI from optimizing its objective in ways you don’t want it to, partial agency is something to look at.
3.
Let’s assume that Dave has read the above story too, because it was printed in PAL’s user manual, and that Dave believes that 2. is the right conclusion. So when [Dave] sees PAL take route B, he will think it most likely that he is still Dave, and that PAL is just trying to trick him into experiencing a simulation epiphany.
Of course, if PAL predicts that Dave thinks he could get manipulated by Simulation Epiphanies, they won’t try the trick in the first place.
But if PAL predicts that Dave predicts PAL would not try to trick him with epiphanies, then PAL will try the trick.
This may create an infinite regress of Dave and PAL trying to predict what level the other is trying to trick them at: A riddle artfully depicted in The Princess Bride.
4.
The above is all wrong. The real problem here is that [Dave]’s mind contains information that allows [Dave] to predict that PAL will take route A, and this information interferes with getting a correct result in a simulation where PAL takes route B instead. In other words, we have a 5-and-10 style problem.
I don’t think this is quite a 5-10 style problem. The 5-10 problem involves an agent trying to decide on the value of different actions when the counterfactual actions themselves can be taken as evidence of what is valuable.
However this problem is about an agent trying to reason about another being (Dave) who may or may not be correct about whether he is in a simulation and may or may not run to help PAL if he believes that he is in one. As a result, it’s more Princess Bride style than anything else.
5.
The above reasoning cannot be correct because it implies that an agent using a less accurate world model containing slightly lobotomized humans will become smarter and/or more aligned.
Generally, agents that are smarter are not necessarily more aligned (and often the two are anti-correlated). In the context of this problem though, I don’t think that the AI needs to limit its models of humans; it just needs to accurately model Dave. Correctly predicting simulation epiphanies indicates an accurate model and incorrectly predicting them indicates an inaccurate model.
If Dave and [Dave] can never prove it when they are in a simulation, then we can show that some of the conclusions above become invalid. But here is modified version of the story. [Dave] sees PAL moving along route A, but then he suddenly notices that he can only see 5 different colors around him, and everything looks like it is made out of polygons...
If Dave and [Dave] could prove that they’re in simulations and in fact go on to do this in actual simulations, this indicates that PAL is not able to simulate Dave and his environment well enough to make good predictions. PAL will consequently give wrong predictions and try to build a better model of the world. It’s also worth noting that, if the simulation world is in five colors and is made out of polygons, then [Dave] likely has not been simulated in enough detail to notice that those things are unusual.
However, your story, it seems that PAL can only get away with this once. After all, once Dave helps PAL get to the coffee machine once and notices that he still exists (ie, PAL has chosen to end the simulation instead of starting a new one with updated knowledge on Dave’s behavior), he will likely no longer believe that he is in a simulation.
Thanks for pointing this out, it had not occurred to me before. So I conclude that when assessing possible risks and countermeasures here, we must to take into account interaction scenarios involving longer time-frames.
Happy Halloween!
This story reminds me a little bit of my comment on Parable of the Predict-O-Matic. Similarities include:
An AI is trying to answer X
Answering X correctly will be a boon to the AI’s objective function
The AI can act in a way that increases the likelihood of correctly answering X
In your example, X is the question “Will Dave help me achieve my objective?” In the parable of the Predict-O-Matic, X is more directly “Will my prediction be accurate?”
In both cases, there is a fixed-point/self-fulfilling prophecy where the AI takes an action (going an unusual route/making an unusual prediction) that is expected to improve the objective function in an unexpected way (the unusual route is less efficient in general than the usual route/the prediction affects the outcome).
As for your scenarios...
1.
The purpose of PAL’s models is to reflect the real world. If PAL regularly simulates Simulation Epiphanies but doesn’t observe them in reality, PAL will just directly update their model to not predict Simulation Epiphanies. If PAL cannot update the simulations for whatever reason though, PAL will do their best to get humans to align with their predictions.
2.
I tend to lean toward this conclusion. However, your story, it seems that PAL can only get away with this once. After all, once Dave helps PAL get to the coffee machine once and notices that he still exists (ie, PAL has chosen to end the simulation instead of starting a new one with updated knowledge on Dave’s behavior), he will likely no longer believe that he is in a simulation. There is a way around this though: PAL could get around this if they are constantly maintaining a simulation of Dave or convinces Dave that this is happening.
I want to caution you that, while this particular instance of the problem (PAL knowing that they can manipulate Dave into doing what they want by making him believe that he’s being simulated) can be pre-empted. The general problem of PAL solving their objective by behaving in ways that manipulate Dave remains unsolved. If you’re interested in learning about preventing AI from optimizing its objective in ways you don’t want it to, partial agency is something to look at.
3.
Of course, if PAL predicts that Dave thinks he could get manipulated by Simulation Epiphanies, they won’t try the trick in the first place.
But if PAL predicts that Dave predicts PAL would not try to trick him with epiphanies, then PAL will try the trick.
This may create an infinite regress of Dave and PAL trying to predict what level the other is trying to trick them at: A riddle artfully depicted in The Princess Bride.
4.
I don’t think this is quite a 5-10 style problem. The 5-10 problem involves an agent trying to decide on the value of different actions when the counterfactual actions themselves can be taken as evidence of what is valuable.
However this problem is about an agent trying to reason about another being (Dave) who may or may not be correct about whether he is in a simulation and may or may not run to help PAL if he believes that he is in one. As a result, it’s more Princess Bride style than anything else.
5.
Generally, agents that are smarter are not necessarily more aligned (and often the two are anti-correlated). In the context of this problem though, I don’t think that the AI needs to limit its models of humans; it just needs to accurately model Dave. Correctly predicting simulation epiphanies indicates an accurate model and incorrectly predicting them indicates an inaccurate model.
If Dave and [Dave] could prove that they’re in simulations and in fact go on to do this in actual simulations, this indicates that PAL is not able to simulate Dave and his environment well enough to make good predictions. PAL will consequently give wrong predictions and try to build a better model of the world. It’s also worth noting that, if the simulation world is in five colors and is made out of polygons, then [Dave] likely has not been simulated in enough detail to notice that those things are unusual.
Thanks for pointing this out, it had not occurred to me before. So I conclude that when assessing possible risks and countermeasures here, we must to take into account interaction scenarios involving longer time-frames.