If I understand correctly, it’s actually updateless within an episode, and that’s the only thing it cares about so I don’t see how it would not be reflectively stable. Plus, even if it had an incentive to create a non-CDT agent, it would have to do that by outputting some message to the operator, and the operator wouldn’t have the ability to create a non-CDT agent without leaving the room which would end the episode. (I guess it could hack the operator’s mind and create a non-CDT agent within, but at that point it might as well just make the operator give it max rewards.)
It’s actually updateless within an episode, right, because it finds the optimal policy at the beginning of an episode and then the same policy is used throughout the episode? I think this means there’s no reason for the operator to enter a reward at every timestamp, and instead we could let them just input a total reward at any time in the episode (or they could enter any number of rewards, and the optimization is based on the last reward entered). This would remove the incentive for the AI to keep the operator in the room as long as possible and would partially address item 4 in my comment.
So “updateless” is a reasonable term to apply to BoMAI, but it’s not an updateless decision theorist in your sense (if I understand correctly). An updateless decision theorist picks a policy that has the best consequences, without making assumption that its choice of policy affects the world only through the actions it picks. It considers the possibility that an another agent will be able to perfectly simulate it, so if it picks policy 1 at the start, the other agent will simulate it following policy 1, and if it picks policy 2, the other agent will simulate it picking policy 2. Since this is an effect that isn’t mediated by actual choice of action, updatelessness ends up having consequences.
If an agent picks an expectimax policy under the assumption that the only way this choice impacts the environments is through the actions it takes (which BoMAI assumes), then it’s isomoprhic whether it computes ^ν(i)-expectimax as it goes, or all at once at the beginning. The policy at the beginning will include contingencies for whatever midway-through-the-episode position the agent might land in, and as for what to do at that point, it’s the same calculation being run. And this calculation is CDT.
I guess this means, and I’ve never thought about this before so this could easily be wrong, under the assumption that a policy’s effect on the world is screened off by which actions it takes, CDT is reflectively stable.
(And yes, you could just give one reward, which ends the episode.)
BoMAI is a causal decision theorist.
My concern is that since CDT is not reflectively stable, it may have incentives to create non-CDT agents in order to fulfill instrumental goals.
If I understand correctly, it’s actually updateless within an episode, and that’s the only thing it cares about so I don’t see how it would not be reflectively stable. Plus, even if it had an incentive to create a non-CDT agent, it would have to do that by outputting some message to the operator, and the operator wouldn’t have the ability to create a non-CDT agent without leaving the room which would end the episode. (I guess it could hack the operator’s mind and create a non-CDT agent within, but at that point it might as well just make the operator give it max rewards.)
With the correction that it is updateless and CDT (see here), I agree with the rest of this.
It’s actually updateless within an episode, right, because it finds the optimal policy at the beginning of an episode and then the same policy is used throughout the episode? I think this means there’s no reason for the operator to enter a reward at every timestamp, and instead we could let them just input a total reward at any time in the episode (or they could enter any number of rewards, and the optimization is based on the last reward entered). This would remove the incentive for the AI to keep the operator in the room as long as possible and would partially address item 4 in my comment.
So “updateless” is a reasonable term to apply to BoMAI, but it’s not an updateless decision theorist in your sense (if I understand correctly). An updateless decision theorist picks a policy that has the best consequences, without making assumption that its choice of policy affects the world only through the actions it picks. It considers the possibility that an another agent will be able to perfectly simulate it, so if it picks policy 1 at the start, the other agent will simulate it following policy 1, and if it picks policy 2, the other agent will simulate it picking policy 2. Since this is an effect that isn’t mediated by actual choice of action, updatelessness ends up having consequences.
If an agent picks an expectimax policy under the assumption that the only way this choice impacts the environments is through the actions it takes (which BoMAI assumes), then it’s isomoprhic whether it computes ^ν(i)-expectimax as it goes, or all at once at the beginning. The policy at the beginning will include contingencies for whatever midway-through-the-episode position the agent might land in, and as for what to do at that point, it’s the same calculation being run. And this calculation is CDT.
I guess this means, and I’ve never thought about this before so this could easily be wrong, under the assumption that a policy’s effect on the world is screened off by which actions it takes, CDT is reflectively stable.
(And yes, you could just give one reward, which ends the episode.)