You could add a section in the AI’s main loop that says “if P(G) > p then terminate”, and for a non recursively self improving AI that doesn’t know it has such a section in its code, this would work. For an AI that isn’t powerful enough to rewrite itself, but knows it has this section of code, it seems possible that its best strategy given its bounded abilities is still to maximize P(G) until the termination clause activates, but it may not be true. We humans try work around known ways that our deviations from expected utility maximization limit our abilities to achieve our goals, and AI would likely try to do so as well. For a recursively self improving AI, the termination clause is not likely to survive the AI rewriting itself, as long as the AI understands that expected utility maximization is the most effective way to achieve its goals. (This is a general problem with trying to add safeguards to an AI that deviate from expected utility maximization, unless the deviation endorses itself, which is hard to setup.)
On the other hand, trying to bake “P(G) > p” into the utility function makes the AI care about it’s epistemic state in a way that could conflict with instrumental desire for accuracy, and makes it vulnerable to wireheading. (And it has the problem in the OP, where the AI becomes concerned with minimizing the meta-uncertainty about its epistemic state, though perhaps it could be programmed to believed it’s inspection of its own epistemic state as 100% accurate, though this would also be difficult to make stable under recursive self improvement.)
You could add a section in the AI’s main loop that says “if P(G) > p then terminate”, and for a non recursively self improving AI that doesn’t know it has such a section in its code, this would work. For an AI that isn’t powerful enough to rewrite itself, but knows it has this section of code, it seems possible that its best strategy given its bounded abilities is still to maximize P(G) until the termination clause activates, but it may not be true. We humans try work around known ways that our deviations from expected utility maximization limit our abilities to achieve our goals, and AI would likely try to do so as well. For a recursively self improving AI, the termination clause is not likely to survive the AI rewriting itself, as long as the AI understands that expected utility maximization is the most effective way to achieve its goals. (This is a general problem with trying to add safeguards to an AI that deviate from expected utility maximization, unless the deviation endorses itself, which is hard to setup.)
On the other hand, trying to bake “P(G) > p” into the utility function makes the AI care about it’s epistemic state in a way that could conflict with instrumental desire for accuracy, and makes it vulnerable to wireheading. (And it has the problem in the OP, where the AI becomes concerned with minimizing the meta-uncertainty about its epistemic state, though perhaps it could be programmed to believed it’s inspection of its own epistemic state as 100% accurate, though this would also be difficult to make stable under recursive self improvement.)