But you still need online access to our MDP (i.e. reward function and transition function), don’t you?
Yep, that’s right! This was what I meant by “the agent starts acting in its environment” in the description of an ODT. So to be clear, during each timestep in the online phase, the ODT looks at a partial trajectory
g1,o1,a1,…,gt−1,ot−1,at−1,gt,ot
of rewards-to-go, observation, and actions; then selects an action at conditional on this partial trajectory; and then the environment provides a new reward rt (so that gt+1=gt−rt) and observation ot+1. Does that make sense?
Thanks for the reference to the Levine paper! I might have more to say after I get a chance to look at it more closely.
Yep, that’s right! This was what I meant by “the agent starts acting in its environment” in the description of an ODT. So to be clear, during each timestep in the online phase, the ODT looks at a partial trajectory
g1,o1,a1,…,gt−1,ot−1,at−1,gt,ot
of rewards-to-go, observation, and actions; then selects an action at conditional on this partial trajectory; and then the environment provides a new reward rt (so that gt+1=gt−rt) and observation ot+1. Does that make sense?
Thanks for the reference to the Levine paper! I might have more to say after I get a chance to look at it more closely.