ESRogs comments on Seeking Power is Often Convergently Instrumental in MDPs

ESRogs 5 Dec 2019 19:34 UTC
LW: 6 AF: 3
AF
Imagine we only care about the reward we get next turn. How many goals choose Candy over Wait? Well, it’s 50-50 – since we randomly choose a number between 0 and 1 for each state, both states have an equal chance of being maximal.
I got a little confused at the introduction of Wait!, but I think I understand it now. So, to check my understanding, and for the benefit of others, some notes:
- the agent gets a reward for the Wait! state, just like the other states
- for terminal states (the three non-Wait! states), the agent stays in that state, and keeps getting the same reward for all future time steps
- so, when comparing Candy vs Wait! + Chocolate, the rewards after three turns would be (R_candy + γ
  * R_candy + γ^2 * R_candy) vs (R_wait + γ * R_chocolate + γ^2 * R_chocolate)
(I had at first assumed the agent got no reward for Wait!, and also failed to realize that the agent keeps getting the reward for the terminal state indefinitely, and so thought it was just about comparing different one-time rewards.)
- TurnTrout 5 Dec 2019 21:13 UTC
  LW: 6 AF: 3
  AF Parent
  Yes. The full expansions (with no limit on the time horizon) are
  
  $\frac{r_{Candy}}{1 - γ}, r_{Wait!} + γ \frac{r_{Chocolate}}{1 - γ}, and r_{Wait!} + γ \frac{r_{Hug}}{1 - γ}$ , where $r_{Candy}, r_{Wait!}, r_{Chocolate}, r_{Hug} \sim unif (0, 1)$ .