We present a useful toy environment for reasoning about deceptive alignment. In this environment, there is a button. Agents have two actions: to press the button or to refrain. If the agent presses the button, they get +1 reward for this episode and −10 reward next episode. One might note a similarity with the traditional marshmallow test of delayed gratification.
Are you sure that “episode” is the word you’re looking for here?
I’m especially confused because you switched to using the word “timestep” later?
Having an action which modifies the reward on a subsequent episode seems very weird. I don’t even see it as being the same agent across different episodes.
Also...
Suppose instead of one button, there are two. One is labeled “STOP,” and if pressed, it would end the environment but give the agent +1 reward. The other is labeled “DEFERENCE” and, if pressed, gives the previous episode’s agent +10 reward but costs −1 reward for the current agent.
Suppose that an agent finds itself existing. What should it do? It might reason that since it knows it already exists, it should press the STOP button and get +1 utility. However, it might be being simulated by its past self to determine if it is allowed to exist. If this is the case, it presses the DEFERENCE button, giving its past self +10 utility and increasing the chance of its existence. This agent has been counterfactually mugged into deferring.
I think as a practical matter, the result depends entirely on the method you’re using to solve the MDP and the rewards that your simulation delivers.
Yes; episode is correct there—the whole point of that example is that, by breaking the episodic independence assumption, otherwise hidden non-myopia can be revealed. See the discussion of the prisoner’s dilemma unit test in Krueger et al.’s “Hidden Incentives for Auto-Induced Distributional Shift” for more detail on how breaking this sort of episodic independence plays out in practice.
Are you sure that “episode” is the word you’re looking for here?
https://www.quora.com/What-does-the-term-“episode”-mean-in-the-context-of-reinforcement-learning-RL
I’m especially confused because you switched to using the word “timestep” later?
Having an action which modifies the reward on a subsequent episode seems very weird. I don’t even see it as being the same agent across different episodes.
Also...
I think as a practical matter, the result depends entirely on the method you’re using to solve the MDP and the rewards that your simulation delivers.
Yes; episode is correct there—the whole point of that example is that, by breaking the episodic independence assumption, otherwise hidden non-myopia can be revealed. See the discussion of the prisoner’s dilemma unit test in Krueger et al.’s “Hidden Incentives for Auto-Induced Distributional Shift” for more detail on how breaking this sort of episodic independence plays out in practice.