Vlad Mikulik comments on Mesa-Search vs Mesa-Control

Vlad Mikulik Sep 15, 2020, 8:41 PM
LW: 3 AF: 3
AF
I am quite confused. I wonder if we agree on the substance but not on the wording, but perhaps it’s worthwhile talking this through.

I follow your argument, and it is what I had in mind when I was responding to you earlier. If approximating $π^{*} (o_{t})$ within the constraints requires computing $f (o_{t})$ , then any policy that approximates $π^{*}$ must compute $f (o_{t})$ . (Assuming appropriate constraints that preclude the policy from being a lookup table precomputed by SGD; not sure if that’s what you meant by “other similar”, though this may be trickier to do formally than we take it to be).

My point is that for $f$ = ‘learning’, I can’t see how anything I would call learning could meaningfully happen inside a single timestep. ‘Learning’ in my head is something that suggests non-ephemeral change; and any lasting change has to feed into the agent’s next state, by which point SGD would have had its chance to make the same change.

Could you give an example of what you mean (this is partially why I wanted to taboo learning)? Or, could you give an example of a task that would require learning in this way? (Note the within-timestep restriction; without that I grant you that there are tasks that require learning).
- evhub Sep 15, 2020, 9:12 PM
  LW: 3 AF: 3
  AF Parent
  
  could you give an example of a task that would require learning in this way? (Note the within-timestep restriction; without that I grant you that there are tasks that require learning)
  
  How about language modeling? I think that the task of predicting what a human will say next given some prompt requires learning in a pretty meaningful way, as it requires the model to be able to learn from the prompt what the human is trying to do and then do that.
  - Vlad Mikulik Sep 19, 2020, 11:41 AM
    LW: 3 AF: 3
    AF Parent
    Good point—I think I wasn’t thinking deeply enough about language modelling. I certainly agree that the model has to learn in the colloquial sense, especially if it’s doing something really impressive that isn’t well-explained by interpolating on dataset examples—I’m imagining giving GPT-X some new mathematical definitions and asking it to make novel proofs.
    I think my confusion was rooted in the fact that you were replying to a section that dealt specifically with learning an inner RL algorithm, and the above sense of ‘learning’ is a bit different from that one. ‘Learning’ in your sense can be required for a task without requiring an inner RL algorithm; or at least, whether it does isn’t clear to me a priori.