Deep reinforcement learning works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability...
[We introduce] a new complexity measure that we call the effective horizon, which roughly corresponds to how many steps of lookahead search are needed in order to identify the next optimal action when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also show that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy.
One of my favorite parts is that it helps formalize this idea of “which parts of the state space are easy to explore into.” That informal notion has been important for my thinking about RL.
At a first pass, I expect reward hacking to happen insofar as it’s easy to explore into reward misspecifications. So in some situations, the reward can be understood as being optimized against, and those situations might to be well-characterized by effective horizon. EG the boat racing example
In other situations, though, you won’t reach optimality due to exploration issues, and I’d instead consider what behaviors and circuits will be reinforced by the experienced rewards. EG RLHF on LLMs probably has this characteristic due to high effective horizon (is my guess)
I think these results help characterize when it’s appropriate to use the “optimization target” frame (i.e. when the effective horizon is low, expect literal optimization of the reward on the training distribution), versus the reinforcement frame.
Theoretical predictions for when reward is maximized on the training distribution. I’m a fan of Laidlaw et al.’s recent Bridging RL Theory and Practice with the Effective Horizon:
One of my favorite parts is that it helps formalize this idea of “which parts of the state space are easy to explore into.” That informal notion has been important for my thinking about RL.
At a first pass, I expect reward hacking to happen insofar as it’s easy to explore into reward misspecifications. So in some situations, the reward can be understood as being optimized against, and those situations might to be well-characterized by effective horizon. EG the boat racing example
In other situations, though, you won’t reach optimality due to exploration issues, and I’d instead consider what behaviors and circuits will be reinforced by the experienced rewards. EG RLHF on LLMs probably has this characteristic due to high effective horizon (is my guess)
I think these results help characterize when it’s appropriate to use the “optimization target” frame (i.e. when the effective horizon is low, expect literal optimization of the reward on the training distribution), versus the reinforcement frame.