Imagine we only care about the reward we get next turn. How many goals choose Candy over Wait? Well, it’s 50-50 – since we randomly choose a number between 0 and 1 for each state, both states have an equal chance of being maximal.
I got a little confused at the introduction of Wait!, but I think I understand it now. So, to check my understanding, and for the benefit of others, some notes:
the agent gets a reward for the Wait! state, just like the other states
for terminal states (the three non-Wait! states), the agent stays in that state, and keeps getting the same reward for all future time steps
so, when comparing Candy vs Wait! + Chocolate, the rewards after three turns would be (R_candy + γ * R_candy + γ^2 * R_candy) vs (R_wait + γ * R_chocolate + γ^2 * R_chocolate)
(I had at first assumed the agent got no reward for Wait!, and also failed to realize that the agent keeps getting the reward for the terminal state indefinitely, and so thought it was just about comparing different one-time rewards.)
I got a little confused at the introduction of Wait!, but I think I understand it now. So, to check my understanding, and for the benefit of others, some notes:
the agent gets a reward for the Wait! state, just like the other states
for terminal states (the three non-Wait! states), the agent stays in that state, and keeps getting the same reward for all future time steps
so, when comparing Candy vs Wait! + Chocolate, the rewards after three turns would be (R_candy + γ
* R_candy + γ^2 * R_candy) vs (R_wait + γ * R_chocolate + γ^2 * R_chocolate)
(I had at first assumed the agent got no reward for Wait!, and also failed to realize that the agent keeps getting the reward for the terminal state indefinitely, and so thought it was just about comparing different one-time rewards.)
Yes. The full expansions (with no limit on the time horizon) are
rCandy1−γ,rWait!+γrChocolate1−γ, and rWait!+γrHug1−γ, where rCandy,rWait!,rChocolate,rHug∼unif(0,1).