I think these possibilities all share the problem that the constraint makes it essentially impossible to choose any action other than what A would have chosen.
I see I’ve miscommunicated the central idea. Let U be the proposition “the agent will remain a u maximiser forever”. Agent A acts as if P(U)=1 (see the entry on value learning). In reality, P(U) is probably very low. So A is a u-maximiser, but a u-maximiser that acts on false beliefs.
Agent B is is allowed to have a better estimate of P(U). Therefore it can find actions that increase u beyond what A would do.
Example: u values rubies deposited in the bank. A will just collect rubies until it can’t carry them any more, then go deposit them in the bank. B, knowing that u will change to something else before A has finished collecting rubies, rushes to the bank ahead of that deadline. So E(u|B) > E(u|A).
And, of course, if B can strictly increase E(u), that gives it some slack to select other actions that can increase (Σpivi).
I see I’ve miscommunicated the central idea. Let U be the proposition “the agent will remain a u maximiser forever”. Agent A acts as if P(U)=1 (see the entry on value learning). In reality, P(U) is probably very low. So A is a u-maximiser, but a u-maximiser that acts on false beliefs.
Agent B is is allowed to have a better estimate of P(U). Therefore it can find actions that increase u beyond what A would do.
Example: u values rubies deposited in the bank. A will just collect rubies until it can’t carry them any more, then go deposit them in the bank. B, knowing that u will change to something else before A has finished collecting rubies, rushes to the bank ahead of that deadline. So E(u|B) > E(u|A).
And, of course, if B can strictly increase E(u), that gives it some slack to select other actions that can increase (Σpivi).