One high-level concern. If I’m reading this right (and please let me know if I’m not), then this is talking about handling counterfactuals by spawning a copy of the AI with a different utility function.
Just on general principles, spawning a copy of the AI with a different utility function seems really, really dangerous. The new copy would be motivated to trade off anything at all in the no-thermodynamic-miracle scenario in order to increase utility in the event of a thermodynamic miracle. In particular, if the AI were boxed (as we might expect for counterfactual processing) then it would presumably try to get out of that box.
Here’s my explanation of it. Let me know if this helps with your concerns at all:
Imagine we have an AI design we want to test. Call this AI C3PO, and let its utility function be U(A) where A is a world-state from the set of all possible world-states. And let the super-unlikely-event-happening-at-the-specified-time described in the post be w such that w = true if it happens and w = false if it doesn’t happen. Then let A be a world state in the subset of all world-states A in which w = true. Basically, A is A given that w happened (this is how we simulate a “false belief” by only allowing the AI to consider worlds in which w = true). Finally, let C be a constant.
The proposal is that we create a variant of C3PO, C3PO* that has the utility function:
U*(A) = P(!w) * C + P(w) * (U(A*))
If the AI is boxed such that it cannot affect the probability of w occurring and it cannot modify its own utility function, then maximizing U is exactly the same as maximizing U once event w has occurred (ie. with false belief w). In this way, we are able to perfectly simulate C3P0 to find out what it would do if w were true, but we don’t actually have to convince it that w* is true.
Right, that much makes sense. The problem is the “perfectly simulate C3PO” part toward the end. If we really want to see what it would do, then we need a perfect simulation of the environment in addition to C3PO itself. Any imperfection, and C3PO might realize it’s in a simulated environment. All else equal, once C3PO* knows it’s in a simulated environment, it would presumably try to get out. Since its utility function is different from C3PO, it would sometimes be motivated to undermine C3PO (or us, if we’re the ones running the simulation).
Just remember that this isn’t a boxing setup. This is just a way of seeing what an AI will do under a false belief. From what I can tell, the concerns you brought up about it trying to get out isn’t any different between the scenario when we simulate C3PO* and when we simulate C3PO. The problem of making a simulation indistinguishable from reality is a separate issue.
One high-level concern. If I’m reading this right (and please let me know if I’m not), then this is talking about handling counterfactuals by spawning a copy of the AI with a different utility function.
Just on general principles, spawning a copy of the AI with a different utility function seems really, really dangerous. The new copy would be motivated to trade off anything at all in the no-thermodynamic-miracle scenario in order to increase utility in the event of a thermodynamic miracle. In particular, if the AI were boxed (as we might expect for counterfactual processing) then it would presumably try to get out of that box.
Here’s my explanation of it. Let me know if this helps with your concerns at all:
Imagine we have an AI design we want to test. Call this AI C3PO, and let its utility function be U(A) where A is a world-state from the set of all possible world-states. And let the super-unlikely-event-happening-at-the-specified-time described in the post be w such that w = true if it happens and w = false if it doesn’t happen. Then let A be a world state in the subset of all world-states A in which w = true. Basically, A is A given that w happened (this is how we simulate a “false belief” by only allowing the AI to consider worlds in which w = true). Finally, let C be a constant.
The proposal is that we create a variant of C3PO, C3PO* that has the utility function:
If the AI is boxed such that it cannot affect the probability of w occurring and it cannot modify its own utility function, then maximizing U is exactly the same as maximizing U once event w has occurred (ie. with false belief w). In this way, we are able to perfectly simulate C3P0 to find out what it would do if w were true, but we don’t actually have to convince it that w* is true.
Right, that much makes sense. The problem is the “perfectly simulate C3PO” part toward the end. If we really want to see what it would do, then we need a perfect simulation of the environment in addition to C3PO itself. Any imperfection, and C3PO might realize it’s in a simulated environment. All else equal, once C3PO* knows it’s in a simulated environment, it would presumably try to get out. Since its utility function is different from C3PO, it would sometimes be motivated to undermine C3PO (or us, if we’re the ones running the simulation).
Just remember that this isn’t a boxing setup. This is just a way of seeing what an AI will do under a false belief. From what I can tell, the concerns you brought up about it trying to get out isn’t any different between the scenario when we simulate C3PO* and when we simulate C3PO. The problem of making a simulation indistinguishable from reality is a separate issue.