Narrow reinforcement learning: As Atakes actions in the world, we give it a dense reward signal based on how reasonable we judge its choices are (perhaps we directly reward state-action pairs themselves rather than outcomes in the world, as in TAMER). Aoptimizes for the expected sum of its future rewards.
Wouldn’t it try to bring about states in which some action is particularly reasonable? Like the villain from that story who brings about a public threat in order to be seen defeating it.
Potentially, it depends on the time horizon and on how the rewards are calculated.
The most natural reward for the state transition (s0, s1) is just V(s1) - V(s0) (where V is some intuitive “human value function,” i.e. ask a human “how good does state s seem?”). This reward function wouldn’t have that problem.
Maximizing the sum of the difference of state value just maximizes state value again, which the point of narrow reinforcement learning was to get away from.
The goal of narrow reinforcement learning is to get something-like-human-level behavior using human-level oversight. Optimizing the human value function over short time horizons seems like a fine approach to me.
The difference with broad reinforcement learning is that you aren’t trying to evaluate actions you can’t understand by looking at the consequences you can observe.
Wouldn’t it try to bring about states in which some action is particularly reasonable? Like the villain from that story who brings about a public threat in order to be seen defeating it.
Potentially, it depends on the time horizon and on how the rewards are calculated.
The most natural reward for the state transition (s0, s1) is just V(s1) - V(s0) (where V is some intuitive “human value function,” i.e. ask a human “how good does state s seem?”). This reward function wouldn’t have that problem.
Maximizing the sum of the difference of state value just maximizes state value again, which the point of narrow reinforcement learning was to get away from.
The goal of narrow reinforcement learning is to get something-like-human-level behavior using human-level oversight. Optimizing the human value function over short time horizons seems like a fine approach to me.
The difference with broad reinforcement learning is that you aren’t trying to evaluate actions you can’t understand by looking at the consequences you can observe.