So “updateless” is a reasonable term to apply to BoMAI, but it’s not an updateless decision theorist in your sense (if I understand correctly). An updateless decision theorist picks a policy that has the best consequences, without making assumption that its choice of policy affects the world only through the actions it picks. It considers the possibility that an another agent will be able to perfectly simulate it, so if it picks policy 1 at the start, the other agent will simulate it following policy 1, and if it picks policy 2, the other agent will simulate it picking policy 2. Since this is an effect that isn’t mediated by actual choice of action, updatelessness ends up having consequences.
If an agent picks an expectimax policy under the assumption that the only way this choice impacts the environments is through the actions it takes (which BoMAI assumes), then it’s isomoprhic whether it computes ^ν(i)-expectimax as it goes, or all at once at the beginning. The policy at the beginning will include contingencies for whatever midway-through-the-episode position the agent might land in, and as for what to do at that point, it’s the same calculation being run. And this calculation is CDT.
I guess this means, and I’ve never thought about this before so this could easily be wrong, under the assumption that a policy’s effect on the world is screened off by which actions it takes, CDT is reflectively stable.
(And yes, you could just give one reward, which ends the episode.)
So “updateless” is a reasonable term to apply to BoMAI, but it’s not an updateless decision theorist in your sense (if I understand correctly). An updateless decision theorist picks a policy that has the best consequences, without making assumption that its choice of policy affects the world only through the actions it picks. It considers the possibility that an another agent will be able to perfectly simulate it, so if it picks policy 1 at the start, the other agent will simulate it following policy 1, and if it picks policy 2, the other agent will simulate it picking policy 2. Since this is an effect that isn’t mediated by actual choice of action, updatelessness ends up having consequences.
If an agent picks an expectimax policy under the assumption that the only way this choice impacts the environments is through the actions it takes (which BoMAI assumes), then it’s isomoprhic whether it computes ^ν(i)-expectimax as it goes, or all at once at the beginning. The policy at the beginning will include contingencies for whatever midway-through-the-episode position the agent might land in, and as for what to do at that point, it’s the same calculation being run. And this calculation is CDT.
I guess this means, and I’ve never thought about this before so this could easily be wrong, under the assumption that a policy’s effect on the world is screened off by which actions it takes, CDT is reflectively stable.
(And yes, you could just give one reward, which ends the episode.)