I was just looking back through 2019 posts, and I think there’s some interesting crosstalk between this post and an insight I recently had (summarized here).
In general, utility maximizers have the form “maximize E[u(X)|blah]”, where u is the utility function and X is a (tuple of) random variables in the agent’s world-model. Implication: utility is a function of random variables in a world model, not a function of world-states. This creates ontology problems because variables in a world-model need not correspond to anything in the real world. For instance, some people earnestly believe in ghosts; ghosts are variables in their world model, and their utility function can depend on how happy the ghosts are.
If we accept the “attainable utility” formulation of impact, this brings up some tricky issues. Do we want to conserve attainable values of E[u(X)], or of u(X) directly? The former leads directly to deceit: if there are actions a human can take which will make them think that u(X) is high, then AU is high under the E[u(X)] formulation, even if there is actually nothing corresponding to u(X). (Example: a human has available actions which will make them think that many ghosts are happy, even though there are no actual ghosts, leading to high AU under the E[u(X)] formulation.) On the other hand, if we try to make attainable values u(X) high directly, then there’s a question of what that even means when there’s no real-world thing corresponding to X. What actions in the real world do or do not conserve the attainable levels of happiness of ghosts?
Right. Another E[u(X)] problem would be, the smart AI realizes that if the dumber human keeps thinking, they’ll realize they’re about to drive off of a cliff, which would negatively impact their attainable utility estimate. Therefore, distract them.
I forgot to mention this in the sequence, but as you say—the formalisms aren’t quite right enough to use as an explicit objective due to confusions about adjacent areas of agency. AUP-the-method attempts to get around that by penalizing catastrophically disempowering behavior, such that the low-impact AI doesn’t obstruct our ability to get what we want (even though it isn’t going out of its way to empower us, either). We’d be trying to make the agent impact/de facto non-obstructive, even though it isn’t going to be intent non-obstructive.
I was just looking back through 2019 posts, and I think there’s some interesting crosstalk between this post and an insight I recently had (summarized here).
In general, utility maximizers have the form “maximize E[u(X)|blah]”, where u is the utility function and X is a (tuple of) random variables in the agent’s world-model. Implication: utility is a function of random variables in a world model, not a function of world-states. This creates ontology problems because variables in a world-model need not correspond to anything in the real world. For instance, some people earnestly believe in ghosts; ghosts are variables in their world model, and their utility function can depend on how happy the ghosts are.
If we accept the “attainable utility” formulation of impact, this brings up some tricky issues. Do we want to conserve attainable values of E[u(X)], or of u(X) directly? The former leads directly to deceit: if there are actions a human can take which will make them think that u(X) is high, then AU is high under the E[u(X)] formulation, even if there is actually nothing corresponding to u(X). (Example: a human has available actions which will make them think that many ghosts are happy, even though there are no actual ghosts, leading to high AU under the E[u(X)] formulation.) On the other hand, if we try to make attainable values u(X) high directly, then there’s a question of what that even means when there’s no real-world thing corresponding to X. What actions in the real world do or do not conserve the attainable levels of happiness of ghosts?
Right. Another E[u(X)] problem would be, the smart AI realizes that if the dumber human keeps thinking, they’ll realize they’re about to drive off of a cliff, which would negatively impact their attainable utility estimate. Therefore, distract them.
I forgot to mention this in the sequence, but as you say—the formalisms aren’t quite right enough to use as an explicit objective due to confusions about adjacent areas of agency. AUP-the-method attempts to get around that by penalizing catastrophically disempowering behavior, such that the low-impact AI doesn’t obstruct our ability to get what we want (even though it isn’t going out of its way to empower us, either). We’d be trying to make the agent impact/de facto non-obstructive, even though it isn’t going to be intent non-obstructive.