gwern comments on What are some non-purely-sampling ways to do deep RL?

gwern 5 Dec 2019 0:33 UTC
LW: 9 AF: 5
AF
You mean stuff like model-predictive control and planning? You can use backprop to do gradient ascent over a sequence of actions if you have a differentiable environment and/or reward model. This also has a lot of application to image CNNs: reversing GANs to encode an image for editing, optimizing to maximize a particular class (like maximally ‘dog’ or ‘NSFW’ images) etc. I cover some of the uses and history in https://www.gwern.net/Faces#reversing-stylegan-to-control-modify-images

My most recent suggestion in this vein was about OA/Christiano’s preference learning, using gradient ascent directly on trajectories/strings, which avoids explicit sampling and rating in an environment.
- evhub 5 Dec 2019 1:19 UTC
  LW: 2 AF: 1
  AF Parent
  Hmmm… not sure if this is exactly what I want. I’d prefer not to assume too much about the environment dynamics. Not sure if this is related to what you’re talking about, but one possibility, maybe, for a way in which you could do model-based planning with an explicit reward function but without assuming much about the environment dynamics could be to learn all the dynamics necessary to do model-based planning in a model-free way (like MuZero) except for the reward function and then include the reward function explicitly.