Gurkenglas comments on Iterated Distillation and Amplification

Gurkenglas 30 Nov 2018 11:40 UTC
LW: 2 AF: 1
AF
Narrow reinforcement learning: As A takes actions in the world, we give it a dense reward signal based on how reasonable we judge its choices are (perhaps we directly reward state-action pairs themselves rather than outcomes in the world, as in TAMER). A optimizes for the expected sum of its future rewards.
Wouldn’t it try to bring about states in which some action is particularly reasonable? Like the villain from that story who brings about a public threat in order to be seen defeating it.
- paulfchristiano 30 Nov 2018 20:43 UTC
  LW: 3 AF: 1
  AF Parent
  Potentially, it depends on the time horizon and on how the rewards are calculated.
  The most natural reward for the state transition (s0, s1) is just V(s1) - V(s0) (where V is some intuitive “human value function,” i.e. ask a human “how good does state s seem?”). This reward function wouldn’t have that problem.
  - Gurkenglas 30 Nov 2018 22:09 UTC
    LW: 1 AF: 1
    AF Parent
    Maximizing the sum of the difference of state value just maximizes state value again, which the point of narrow reinforcement learning was to get away from.
    - paulfchristiano 30 Nov 2018 22:20 UTC
      LW: 4 AF: 2
      AF Parent
      The goal of narrow reinforcement learning is to get something-like-human-level behavior using human-level oversight. Optimizing the human value function over short time horizons seems like a fine approach to me.
      The difference with broad reinforcement learning is that you aren’t trying to evaluate actions you can’t understand by looking at the consequences you can observe.