I second the suggestion to state what is being proved before proving it.
One important note is that CDT spectacularly fails this property. Namely, consider a game of matching pennies against a powerful predictor. Since the environment takes actions as input, it’s possible to recompute what would have happened if a different action is plugged in. The CDT agent that keeps losing is going to learn to randomize between actions since it keeps seeing that the action it didn’t take would have done better. So it eventually gets to a state where it predicts the reward from “pick heads” and “pick tails” is 0.5 (because there’s a 50% chance it doesn’t pick heads/tails), but it predicts the reward from “I take an action” is 0, violating this assumption.
Note, however, that ordinary Bayes-optimal RL works perfectly (assuming there are no traps in the prior or paranoia is otherwise avoided), since it would believe that taking a certain action causes the predictor to make the optimal response. This is similar to RL one-boxing in the repeated Newcomb’s problem.
I second the suggestion to state what is being proved before proving it.
Note, however, that ordinary Bayes-optimal RL works perfectly (assuming there are no traps in the prior or paranoia is otherwise avoided), since it would believe that taking a certain action causes the predictor to make the optimal response. This is similar to RL one-boxing in the repeated Newcomb’s problem.