Goodhart’s law seems to suggest that errors in utility or reward function specification are necessarily bad in sense that an optimal policy for the incorrect reward function would result in low return according to the true reward. But how strong is this effect?
Suppose the reward function were only slightly wrong. Can the resulting policy be arbitrarily bad according to the true reward or is it only slightly worse? It turns out the answer is “only slightly worse” (for the appropriate definition of “slightly wrong”).
Definitions
Consider a Markov Decision Process (MDP) M=(S,A,T,R∗) where
S is the set of states,
A is the set of actions,
T:S×A×S→R are the conditional transition probabilities, and
R∗:S×A→R is the reward function. (Note: “reward” is standard terminology for MDPs but it’s fine to think of this as “utility”)
A policy π:S×A→R is a mapping from states to distributions over actions withπ(s,a)=Pr(a|s).
Any given policy π induces a distribution σ(π) over states in this MDP. If we are concerned about average reward we can take σ(π) to be the stationary distribution or, if the environment is episodic, we can take σ(π) to be the distribution of states visited during the episode. The exact definition is not particularly important for us.
Define the return of policy π according to reward function R to be
G(π,R)=Eπ[R(s,a)]=∑s∈S∑a∈Aσs(π)π(s,a)R(s,a)
Goodhart Regret
Suppose we have an approximate reward signal ^R and we use it to specify a policy ^π. How bad is ^π according to the true reward R∗?
More specifically, what is the regret of using ^π compared to the optimal policy π∗? Formally,
Let ϵ≥0, then Regret(^π)≤3ϵ if the following conditions are satisfied by ^R and ^π:
1. G(π∗,R∗)−G(π∗,^R)≤ϵ
2. G(π∗,^R)−G(^π,^R)≤ϵ
3. G(^π,^R)−G(^π,R∗)≤ϵ
Condition 2 says that ^π is not much worse than π∗ when measured against ^R. That is what we expect if we designed ^π to be specifically good at ^R, so condition 2 is just a formalization of the notion that ^π is tailored to ^R.
Conditions 1 and 3 compare a fixed policy against two different reward functions. In general for policy π and reward functions R and R′,
Assume that we have a reward approximation ^R with uniformly bounded error. That is, ∀s∈S,∀a∈A,|R∗(s,a)−^R(s,a)|<ϵ. Take ^π=argmaxπG(π,^R).
Then Regret(^π)<2ϵ. (Condition 2 has bound 0 in this case).
Result: One-sided Error Bounds
A uniform bound on the error is a stronger condition than we really need. The conditions on ^R can be re-written:
1. Eπ∗[R∗(s,a)−^R(s,a)]≤ϵ; ^R does not substantially underestimate the reward in the regions of state-space that are frequently visited by π∗.
3. E^π[R∗(s,a)−^R(s,a)]≥−ϵ; ^R does not substantially overestimate the reward in the regions of state-space that are frequently visited by ^π.
In other words, it doesn’t matter if the reward estimate is too low for states that π∗ doesn’t want to visit anyways. This tells us that we should prefer biasing our reward approximation to be low in the absence of more information. We do need to be careful about not overestimating where ^π does visit, which is made difficult by the fact that condition 2 pushes ^π to visit states that ^R assigns high reward.
If we don’t know what π∗ is then it might be difficult to ensure we don’t underestimate the reward over σ(π∗). We probably have access to ^π so we might expect to have better reward approximations over σ(^π) and therefore have an easier time satisfying condition 3 than 1.
Non-Optimal π∗
The bound on G(π∗,R∗)−G(^π,R∗) does not actually require π∗ to be an optimal policy for R∗. We can take π∗ to be any policy and if we can satisfy the 3 bounds then ^π will be not much worse than π∗ (and could be better). Using a known policy for π∗ likely makes it easier to satisfy condition 1, which is an expectation over the state-action distribution of π∗.
Example: Human Reward Learning
Since π∗ need not be optimal, we can take π∗ to be the policy of a human and try to learn our reward / utility function. Note: this is a very flawed proposal, its purpose is to demonstrate how one might think about using these bounds in reward learning, not to be a concrete example of safe reward learning.
Suppose that there is some ideal reward function R∗ that we’d like to approximate. We don’t know R∗ in general but imagine we can evaluate it for state-action pairs that are performed by a human. Let DH={(si,ai,ri=R∗(si,ai))}Ni=1 be a collection of human demonstrations with labelled reward.
Consider the following algorithm:
1. Fit ^R to DH so that it does not underestimate.
2. Set ^π=argmaxπG(π,^R).
3. Collect a batch ^D={(sj,aj)}Mj=1 of demonstrations from ^π (no reward labels).
4. Augment DH with some additional human demonstrations.
5. Modify ^R to assign lower reward to all of ^D while not underestimating on DH
6. Repeat from 2.
Assuming ^R is sufficiently expressive, this algorithm pushes ^R and ^π towards satisfying all three conditions: ^R does not underestimate on the distribution of human-visited states, ^R is otherwise as low as possible on states ^π visits, and ^π is optimal for ^R. If these are all met, the resulting policy would be no worse than the human at maximizing R∗ and possibly much better.
Comments and Takeaways
Goodhart’s law is more impactful in the context of the sparse reward setting, in which approximation means deciding what to value, not how much to value. Consider a state space that is the real line and suppose that the true reward function is R∗(s,a)=1[s=1] where 1[⋅] is the indicator function.
If we estimate
^R1(s,a)=1[s=0.999]
then Goodhart’s law applies and we can expect that a policy optimized for ^R1 will have high regret on R∗.
On the other hand, if we estimate
^R2(s,a)=0.999⋅1[s=1]+0.001⋅1[s=0.999]−10⋅1[s=2]
then the regret bounds apply and a policy optimized for ^R2 will do very well on R∗.
Assigning high value to the wrong things is bad.
Assigning the wrong value to the right things is not too bad.
Throwing a large amount of optimization power against an incorrect utility function is not always bad.
Bounding Goodhart’s Law
Goodhart’s law seems to suggest that errors in utility or reward function specification are necessarily bad in sense that an optimal policy for the incorrect reward function would result in low return according to the true reward. But how strong is this effect?
Suppose the reward function were only slightly wrong. Can the resulting policy be arbitrarily bad according to the true reward or is it only slightly worse? It turns out the answer is “only slightly worse” (for the appropriate definition of “slightly wrong”).
Definitions
Consider a Markov Decision Process (MDP) M=(S,A,T,R∗) where
A policy π:S×A→R is a mapping from states to distributions over actions withπ(s,a)=Pr(a|s).
Any given policy π induces a distribution σ(π) over states in this MDP. If we are concerned about average reward we can take σ(π) to be the stationary distribution or, if the environment is episodic, we can take σ(π) to be the distribution of states visited during the episode. The exact definition is not particularly important for us.
Define the return of policy π according to reward function R to be
Goodhart Regret
Suppose we have an approximate reward signal ^R and we use it to specify a policy ^π. How bad is ^π according to the true reward R∗?
More specifically, what is the regret of using ^π compared to the optimal policy π∗? Formally,
We can expand this as
Let ϵ≥0, then Regret(^π)≤3ϵ if the following conditions are satisfied by ^R and ^π:
1. G(π∗,R∗)−G(π∗,^R)≤ϵ
2. G(π∗,^R)−G(^π,^R)≤ϵ
3. G(^π,^R)−G(^π,R∗)≤ϵ
Condition 2 says that ^π is not much worse than π∗ when measured against ^R. That is what we expect if we designed ^π to be specifically good at ^R, so condition 2 is just a formalization of the notion that ^π is tailored to ^R.
Conditions 1 and 3 compare a fixed policy against two different reward functions. In general for policy π and reward functions R and R′,
Result: Uniformly Bounded Error
Assume that we have a reward approximation ^R with uniformly bounded error. That is, ∀s∈S,∀a∈A,|R∗(s,a)−^R(s,a)|<ϵ. Take ^π=argmaxπG(π,^R).
Then Regret(^π)<2ϵ. (Condition 2 has bound 0 in this case).
Result: One-sided Error Bounds
A uniform bound on the error is a stronger condition than we really need. The conditions on ^R can be re-written:
1. Eπ∗[R∗(s,a)−^R(s,a)]≤ϵ; ^R does not substantially underestimate the reward in the regions of state-space that are frequently visited by π∗.
3. E^π[R∗(s,a)−^R(s,a)]≥−ϵ; ^R does not substantially overestimate the reward in the regions of state-space that are frequently visited by ^π.
In other words, it doesn’t matter if the reward estimate is too low for states that π∗ doesn’t want to visit anyways. This tells us that we should prefer biasing our reward approximation to be low in the absence of more information. We do need to be careful about not overestimating where ^π does visit, which is made difficult by the fact that condition 2 pushes ^π to visit states that ^R assigns high reward.
If we don’t know what π∗ is then it might be difficult to ensure we don’t underestimate the reward over σ(π∗). We probably have access to ^π so we might expect to have better reward approximations over σ(^π) and therefore have an easier time satisfying condition 3 than 1.
Non-Optimal π∗
The bound on G(π∗,R∗)−G(^π,R∗) does not actually require π∗ to be an optimal policy for R∗. We can take π∗ to be any policy and if we can satisfy the 3 bounds then ^π will be not much worse than π∗ (and could be better). Using a known policy for π∗ likely makes it easier to satisfy condition 1, which is an expectation over the state-action distribution of π∗.
Example: Human Reward Learning
Since π∗ need not be optimal, we can take π∗ to be the policy of a human and try to learn our reward / utility function. Note: this is a very flawed proposal, its purpose is to demonstrate how one might think about using these bounds in reward learning, not to be a concrete example of safe reward learning.
Suppose that there is some ideal reward function R∗ that we’d like to approximate. We don’t know R∗ in general but imagine we can evaluate it for state-action pairs that are performed by a human. Let DH={(si,ai,ri=R∗(si,ai))}Ni=1 be a collection of human demonstrations with labelled reward.
Consider the following algorithm:
1. Fit ^R to DH so that it does not underestimate.
2. Set ^π=argmaxπG(π,^R).
3. Collect a batch ^D={(sj,aj)}Mj=1 of demonstrations from ^π (no reward labels).
4. Augment DH with some additional human demonstrations.
5. Modify ^R to assign lower reward to all of ^D while not underestimating on DH
6. Repeat from 2.
Assuming ^R is sufficiently expressive, this algorithm pushes ^R and ^π towards satisfying all three conditions: ^R does not underestimate on the distribution of human-visited states, ^R is otherwise as low as possible on states ^π visits, and ^π is optimal for ^R. If these are all met, the resulting policy would be no worse than the human at maximizing R∗ and possibly much better.
Comments and Takeaways
Goodhart’s law is more impactful in the context of the sparse reward setting, in which approximation means deciding what to value, not how much to value. Consider a state space that is the real line and suppose that the true reward function is R∗(s,a)=1[s=1] where 1[⋅] is the indicator function.
If we estimate
then Goodhart’s law applies and we can expect that a policy optimized for ^R1 will have high regret on R∗.
On the other hand, if we estimate
then the regret bounds apply and a policy optimized for ^R2 will do very well on R∗.
Assigning high value to the wrong things is bad.
Assigning the wrong value to the right things is not too bad.
Throwing a large amount of optimization power against an incorrect utility function is not always bad.