Tomek Korbak comments on Safety considerations for online generative modeling

Tomek Korbak 8 Jul 2022 19:10 UTC
3 points
0
I really liked the post and the agenda of improving safety through generative modelling is close to my heart.

we begin an online phase of its training: the agent starts acting in its environment and generating new task completions, which are recorded and fed back into the decision transformer as new training data

But you still need online access to our MDP (i.e. reward function and transition function), don’t you? And it’s access to MDP that drives novelty and improvement If you were just sampling whole trajectories from the model (asking the model itself to simulate reward function and transition model) and feeding them back into the model, you should expect any change (on average). Your gradients updates will cancel out, that’s a consequence of the expected-grad-log-prob lemma ( $E_{x \sim π_{θ}} \nabla_{θ} log π_{θ} (x) = 0$ ).

It gets more nuanced when you account for doing ancestral sampling, but it adds problems, not solves them:
https://arxiv.org/abs/2110.10819

Reproduce the “Learning to Summarize with Human Feedback” paper but with a frozen reward model.

On the other hand, in their follow-up work on instruction following, OpenAI claimed they used little online data (from fine-tuned policies):
https://arxiv.org/abs/2203.02155

It would be interesting to figure out a way to factorize the policy in (a) over timesteps, i.e. produce distributions $(π (\cdot), π (\cdot | τ_{1}), π (\cdot | τ_{1} τ_{2}), \dots, π (\cdot | τ_{1} \dots τ_{T - 1})$ \) over actions conditional on partial trajectories

Levine derives that in his control-as-inference tutorial paper (section 2.3). Your expected exponential total reward is pretty close. Not that it acts a bit like an (exponentiated) Q function for your policy: it gives you exp-reward expected after taking action $τ_{t}$ at state $τ_{< t}$ and following $π$ thereafter. The exponential works like a soft argmax, so it gives you something like soft Q-learning but not quite: argmax is also over environment dynamics, not only over policy. So it causes an optimism bias: your agent effectively assumes an optimal next state will sampled for it every time, however unlikely would that be. The rest of Levine’s paper deals with that.
- Sam Marks 8 Jul 2022 21:05 UTC
  3 points
  0
  Parent
  But you still need online access to our MDP (i.e. reward function and transition function), don’t you?
  Yep, that’s right! This was what I meant by “the agent starts acting in its environment” in the description of an ODT. So to be clear, during each timestep in the online phase, the ODT looks at a partial trajectory
  $g_{1}, o_{1}, a_{1}, \dots, g_{t - 1}, o_{t - 1}, a_{t - 1}, g_{t}, o_{t}$
  of rewards-to-go, observation, and actions; then selects an action $a_{t}$ conditional on this partial trajectory; and then the environment provides a new reward $r_{t}$ (so that $g_{t + 1} = g_{t} - r_{t}$ ) and observation $o_{t + 1}$ . Does that make sense?
  Thanks for the reference to the Levine paper! I might have more to say after I get a chance to look at it more closely.