Are there UDT-ish concerns with breaking isolation of episodes? For example, if the AI receives a low reward at the beginning of episode 117, does it have an incentive to manipulate the external world to make episode 117 happen many times somehow, with most of these times giving it a higher reward? For another example, can the AI at episode 117 realize that it’s in a game theory situation with the AI at episodes 116 and 118 and trade rewards with them acausally, leading to long-term goal directed behavior?
If I understand correctly, it’s actually updateless within an episode, and that’s the only thing it cares about so I don’t see how it would not be reflectively stable. Plus, even if it had an incentive to create a non-CDT agent, it would have to do that by outputting some message to the operator, and the operator wouldn’t have the ability to create a non-CDT agent without leaving the room which would end the episode. (I guess it could hack the operator’s mind and create a non-CDT agent within, but at that point it might as well just make the operator give it max rewards.)
It’s actually updateless within an episode, right, because it finds the optimal policy at the beginning of an episode and then the same policy is used throughout the episode? I think this means there’s no reason for the operator to enter a reward at every timestamp, and instead we could let them just input a total reward at any time in the episode (or they could enter any number of rewards, and the optimization is based on the last reward entered). This would remove the incentive for the AI to keep the operator in the room as long as possible and would partially address item 4 in my comment.
So “updateless” is a reasonable term to apply to BoMAI, but it’s not an updateless decision theorist in your sense (if I understand correctly). An updateless decision theorist picks a policy that has the best consequences, without making assumption that its choice of policy affects the world only through the actions it picks. It considers the possibility that an another agent will be able to perfectly simulate it, so if it picks policy 1 at the start, the other agent will simulate it following policy 1, and if it picks policy 2, the other agent will simulate it picking policy 2. Since this is an effect that isn’t mediated by actual choice of action, updatelessness ends up having consequences.
If an agent picks an expectimax policy under the assumption that the only way this choice impacts the environments is through the actions it takes (which BoMAI assumes), then it’s isomoprhic whether it computes ^ν(i)-expectimax as it goes, or all at once at the beginning. The policy at the beginning will include contingencies for whatever midway-through-the-episode position the agent might land in, and as for what to do at that point, it’s the same calculation being run. And this calculation is CDT.
I guess this means, and I’ve never thought about this before so this could easily be wrong, under the assumption that a policy’s effect on the world is screened off by which actions it takes, CDT is reflectively stable.
(And yes, you could just give one reward, which ends the episode.)
does it have an incentive to manipulate the external world to make episode 117 happen many times somehow
For any given world-model, episode 117 is just a string of actions on the input tape, and observations and rewards on the output tape (positions (m+1)*117 through (m+1)*118 −1, if you care). In none of these world-models, under no actions that it considers, does “episode 117 happen twice.”
In none of these world-models, under no actions that it considers, does “episode 117 happen twice.”
Yes, episode 117 happens only once in the world model; and suppose the agent cares only about episode 117 in the “current execution”. The concern still holds: the agent might write a malign output that would result in additional invocations of itself in which episode 117 ends with the agent getting a high reward. Note that the agent does not care about the other executions of itself. The only purpose of the malign output is to increase the probability that the “current execution” is one that ends with the agent receiving a high reward.
Okay so I think you could construct a world-model that reflects this sort of reasoning, where it associates reward with the reward provided to a randomly sampled instance of its algorithm in the world in a way that looks like this. But the “malign output that would result in additional invocations of itself” would require the operator to leave the room, so this has the same form as, for example, ν†. At this point, I think we’re no longer considering anything that sounds like “episode 117 happening twice,” but that’s fine. Also, just a side-note: this world-model would get ruled out if the rewards/observations provided to the two separate instances ever diverge.
Are there UDT-ish concerns with breaking isolation of episodes? For example, if the AI receives a low reward at the beginning of episode 117, does it have an incentive to manipulate the external world to make episode 117 happen many times somehow, with most of these times giving it a higher reward? For another example, can the AI at episode 117 realize that it’s in a game theory situation with the AI at episodes 116 and 118 and trade rewards with them acausally, leading to long-term goal directed behavior?
BoMAI is a causal decision theorist.
My concern is that since CDT is not reflectively stable, it may have incentives to create non-CDT agents in order to fulfill instrumental goals.
If I understand correctly, it’s actually updateless within an episode, and that’s the only thing it cares about so I don’t see how it would not be reflectively stable. Plus, even if it had an incentive to create a non-CDT agent, it would have to do that by outputting some message to the operator, and the operator wouldn’t have the ability to create a non-CDT agent without leaving the room which would end the episode. (I guess it could hack the operator’s mind and create a non-CDT agent within, but at that point it might as well just make the operator give it max rewards.)
With the correction that it is updateless and CDT (see here), I agree with the rest of this.
It’s actually updateless within an episode, right, because it finds the optimal policy at the beginning of an episode and then the same policy is used throughout the episode? I think this means there’s no reason for the operator to enter a reward at every timestamp, and instead we could let them just input a total reward at any time in the episode (or they could enter any number of rewards, and the optimization is based on the last reward entered). This would remove the incentive for the AI to keep the operator in the room as long as possible and would partially address item 4 in my comment.
So “updateless” is a reasonable term to apply to BoMAI, but it’s not an updateless decision theorist in your sense (if I understand correctly). An updateless decision theorist picks a policy that has the best consequences, without making assumption that its choice of policy affects the world only through the actions it picks. It considers the possibility that an another agent will be able to perfectly simulate it, so if it picks policy 1 at the start, the other agent will simulate it following policy 1, and if it picks policy 2, the other agent will simulate it picking policy 2. Since this is an effect that isn’t mediated by actual choice of action, updatelessness ends up having consequences.
If an agent picks an expectimax policy under the assumption that the only way this choice impacts the environments is through the actions it takes (which BoMAI assumes), then it’s isomoprhic whether it computes ^ν(i)-expectimax as it goes, or all at once at the beginning. The policy at the beginning will include contingencies for whatever midway-through-the-episode position the agent might land in, and as for what to do at that point, it’s the same calculation being run. And this calculation is CDT.
I guess this means, and I’ve never thought about this before so this could easily be wrong, under the assumption that a policy’s effect on the world is screened off by which actions it takes, CDT is reflectively stable.
(And yes, you could just give one reward, which ends the episode.)
For any given world-model, episode 117 is just a string of actions on the input tape, and observations and rewards on the output tape (positions (m+1)*117 through (m+1)*118 −1, if you care). In none of these world-models, under no actions that it considers, does “episode 117 happen twice.”
Yes, episode 117 happens only once in the world model; and suppose the agent cares only about episode 117 in the “current execution”. The concern still holds: the agent might write a malign output that would result in additional invocations of itself in which episode 117 ends with the agent getting a high reward. Note that the agent does not care about the other executions of itself. The only purpose of the malign output is to increase the probability that the “current execution” is one that ends with the agent receiving a high reward.
Okay so I think you could construct a world-model that reflects this sort of reasoning, where it associates reward with the reward provided to a randomly sampled instance of its algorithm in the world in a way that looks like this. But the “malign output that would result in additional invocations of itself” would require the operator to leave the room, so this has the same form as, for example, ν†. At this point, I think we’re no longer considering anything that sounds like “episode 117 happening twice,” but that’s fine. Also, just a side-note: this world-model would get ruled out if the rewards/observations provided to the two separate instances ever diverge.