paulfchristiano comments on Iterated Distillation and Amplification

paulfchristiano 30 Nov 2018 20:43 UTC
LW: 3 AF: 1
AF
Potentially, it depends on the time horizon and on how the rewards are calculated.
The most natural reward for the state transition (s0, s1) is just V(s1) - V(s0) (where V is some intuitive “human value function,” i.e. ask a human “how good does state s seem?”). This reward function wouldn’t have that problem.
- Gurkenglas 30 Nov 2018 22:09 UTC
  LW: 1 AF: 1
  AF Parent
  Maximizing the sum of the difference of state value just maximizes state value again, which the point of narrow reinforcement learning was to get away from.
  - paulfchristiano 30 Nov 2018 22:20 UTC
    LW: 4 AF: 2
    AF Parent
    The goal of narrow reinforcement learning is to get something-like-human-level behavior using human-level oversight. Optimizing the human value function over short time horizons seems like a fine approach to me.
    The difference with broad reinforcement learning is that you aren’t trying to evaluate actions you can’t understand by looking at the consequences you can observe.