Potentially, it depends on the time horizon and on how the rewards are calculated.
The most natural reward for the state transition (s0, s1) is just V(s1) - V(s0) (where V is some intuitive “human value function,” i.e. ask a human “how good does state s seem?”). This reward function wouldn’t have that problem.
Maximizing the sum of the difference of state value just maximizes state value again, which the point of narrow reinforcement learning was to get away from.
The goal of narrow reinforcement learning is to get something-like-human-level behavior using human-level oversight. Optimizing the human value function over short time horizons seems like a fine approach to me.
The difference with broad reinforcement learning is that you aren’t trying to evaluate actions you can’t understand by looking at the consequences you can observe.
Potentially, it depends on the time horizon and on how the rewards are calculated.
The most natural reward for the state transition (s0, s1) is just V(s1) - V(s0) (where V is some intuitive “human value function,” i.e. ask a human “how good does state s seem?”). This reward function wouldn’t have that problem.
Maximizing the sum of the difference of state value just maximizes state value again, which the point of narrow reinforcement learning was to get away from.
The goal of narrow reinforcement learning is to get something-like-human-level behavior using human-level oversight. Optimizing the human value function over short time horizons seems like a fine approach to me.
The difference with broad reinforcement learning is that you aren’t trying to evaluate actions you can’t understand by looking at the consequences you can observe.