Sam Marks comments on Safety considerations for online generative modeling

Sam Marks 8 Jul 2022 21:05 UTC
3 points
0
But you still need online access to our MDP (i.e. reward function and transition function), don’t you?
Yep, that’s right! This was what I meant by “the agent starts acting in its environment” in the description of an ODT. So to be clear, during each timestep in the online phase, the ODT looks at a partial trajectory
$g_{1}, o_{1}, a_{1}, \dots, g_{t - 1}, o_{t - 1}, a_{t - 1}, g_{t}, o_{t}$
of rewards-to-go, observation, and actions; then selects an action $a_{t}$ conditional on this partial trajectory; and then the environment provides a new reward $r_{t}$ (so that $g_{t + 1} = g_{t} - r_{t}$ ) and observation $o_{t + 1}$ . Does that make sense?
Thanks for the reference to the Levine paper! I might have more to say after I get a chance to look at it more closely.