I really liked the post and the agenda of improving safety through generative modelling is close to my heart.
we begin an online phase of its training: the agent starts acting in its environment and generating new task completions, which are recorded and fed back into the decision transformer as new training data
But you still need online access to our MDP (i.e. reward function and transition function), don’t you? And it’s access to MDP that drives novelty and improvement If you were just sampling whole trajectories from the model (asking the model itself to simulate reward function and transition model) and feeding them back into the model, you should expect any change (on average). Your gradients updates will cancel out, that’s a consequence of the expected-grad-log-prob lemma (Ex∼πθ∇θlogπθ(x)=0).
On the other hand, in their follow-up work on instruction following, OpenAI claimed they used little online data (from fine-tuned policies): https://arxiv.org/abs/2203.02155
It would be interesting to figure out a way to factorize the policy in (a) over timesteps, i.e. produce distributions (π(⋅),π(⋅|τ1),π(⋅|τ1τ2),…,π(⋅|τ1…τT−1)\) over actions conditional on partial trajectories
Levine derives that in his control-as-inference tutorial paper (section 2.3). Your expected exponential total reward is pretty close. Not that it acts a bit like an (exponentiated) Q function for your policy: it gives you exp-reward expected after taking action τt at state τ<t and following π thereafter. The exponential works like a soft argmax, so it gives you something like soft Q-learning but not quite: argmax is also over environment dynamics, not only over policy. So it causes an optimism bias: your agent effectively assumes an optimal next state will sampled for it every time, however unlikely would that be. The rest of Levine’s paper deals with that.
But you still need online access to our MDP (i.e. reward function and transition function), don’t you?
Yep, that’s right! This was what I meant by “the agent starts acting in its environment” in the description of an ODT. So to be clear, during each timestep in the online phase, the ODT looks at a partial trajectory
g1,o1,a1,…,gt−1,ot−1,at−1,gt,ot
of rewards-to-go, observation, and actions; then selects an action at conditional on this partial trajectory; and then the environment provides a new reward rt (so that gt+1=gt−rt) and observation ot+1. Does that make sense?
Thanks for the reference to the Levine paper! I might have more to say after I get a chance to look at it more closely.
I really liked the post and the agenda of improving safety through generative modelling is close to my heart.
But you still need online access to our MDP (i.e. reward function and transition function), don’t you? And it’s access to MDP that drives novelty and improvement If you were just sampling whole trajectories from the model (asking the model itself to simulate reward function and transition model) and feeding them back into the model, you should expect any change (on average). Your gradients updates will cancel out, that’s a consequence of the expected-grad-log-prob lemma (Ex∼πθ∇θlogπθ(x)=0).
It gets more nuanced when you account for doing ancestral sampling, but it adds problems, not solves them:
https://arxiv.org/abs/2110.10819
On the other hand, in their follow-up work on instruction following, OpenAI claimed they used little online data (from fine-tuned policies):
https://arxiv.org/abs/2203.02155
Levine derives that in his control-as-inference tutorial paper (section 2.3). Your expected exponential total reward is pretty close. Not that it acts a bit like an (exponentiated) Q function for your policy: it gives you exp-reward expected after taking action τt at state τ<t and following π thereafter. The exponential works like a soft argmax, so it gives you something like soft Q-learning but not quite: argmax is also over environment dynamics, not only over policy. So it causes an optimism bias: your agent effectively assumes an optimal next state will sampled for it every time, however unlikely would that be. The rest of Levine’s paper deals with that.
Yep, that’s right! This was what I meant by “the agent starts acting in its environment” in the description of an ODT. So to be clear, during each timestep in the online phase, the ODT looks at a partial trajectory
g1,o1,a1,…,gt−1,ot−1,at−1,gt,ot
of rewards-to-go, observation, and actions; then selects an action at conditional on this partial trajectory; and then the environment provides a new reward rt (so that gt+1=gt−rt) and observation ot+1. Does that make sense?
Thanks for the reference to the Levine paper! I might have more to say after I get a chance to look at it more closely.