Goodheart isn’t about cases where the proxy approximates the “true” goal, it’s about cases where the approximation just completely breaks down in some regimes.
I agree with your general sentiment but how does that translate formally to the math framework of the paper? Or in other words, where does their formulation diverge from reality?
Perhaps it’s in how they define the occupancy function over explicit (state, action) pairs—Seems like the occupancy measure doesn’t actually weight by state probability correctly? which seems odd—so you could have 2 reward functions that seem arbitrarily aligned (dot product close to 1), but only because they agree on the vast volume of highly improbable states, and not the tiny manifold of likely states.
Moreover in reality the state space is essentially infinite and everything must operate in a highly compressed model space for generalization regardless, so even if the ‘true’ unknown utility function can be defined over the (potentially infinite) state space, any practical reward proxy can not—it is a function of some limited low dimensional encoding of the state space, such that the mapping from that to the full state space is highly nonlinear and complex. We can’t realistically expand that function to the true full state space, nor expect the linearity to translate into the compressed model space. Tiny changes in the model space can translate to arbitrary jumps in the full state space, updates to the model compression function (ontology shifts) can shift everything around, etc.
So here’s a thing that I think John is pointing at, with a bit more math?:
The diversion is in the distance function.
- In the paper, we define the distance between rewards as the angle between reward vectors. - So what we sort of do is look at the “dot product”, i.E., look at E[R1(S,A)⋅R2(S,A)] for true and proxy rewards R1 and R2 with states/actions sampled according to a uniform distribution. I give justification as to why this is a natural way to define distance in a separate comment.
But the issue here is that this isn’t the distribution of the actions/states we might see in practice.E[R1(S,A)⋅R2(S,A)] might be very high if states/actions are instead weighted by drawing them from a distribution induced from a certain policy (e.g., the policy of “killing lots of snakes without doing anything sneaky to game the reward” in the examples, I think?). But then as people optimize, the policy changes and this number goes down. A uniform distribution is actually likely quite far from any state/action distributions we would see in practice.
In other words the way we formally define reward distance here will often not match how “close” two reward functions seem, and lots of cases of “Goodharting” are cases where two reward functions just seem close on a particular state/action distribution but aren’t close according to our distance metric.
This makes the results of the paper primarily useful for working towards training regimes where we optimize the proxy and can approximate distance, which is described in Appendix F of the paper. This is because as we optimize the proxy it will start to generalize, and then problems with over-optimization as described in the paper are going to start mattering a lot more.
So more concretely, this is work towards some sort of RLHF training regime that “provably” avoids Goodharting. The main issue is that a lot of the numbers we’re using are quite hard to approximate.
Thanks that clarifies somewhat, but guess I’ll need to read the paper—still a bit confused about the justification for a uniform distribution.
with states/actions sampled according to a uniform distribution. I give justification as to why this is a very natural way to define distance in a separate comment.
A uniform distribution actually seems like a very weird choice here.
Defining utility functions over full world states seems fine (even if not practical at larger scale), and defining alignment as dot products over full trajectory/state space utility functions also seems fine, but only if using true expected utility (ie the actual bayesian posterior distribution over states). That of course can get arbitrarily complex.
But it also seems necessary in that for one to say that two utility functions are truly ‘close’ seems like that must cash out to closeness of (perhaps normalized) expected utilities given the true distribution of future trajectories.
Do you see a relation between the early stopping criteria and regularization/generalization of the proxy reward?
The reason we’re using a uniform distribution is that it follows naturally from the math, but maybe an intuitive explanation is the following: the reason this is weird is that most realistic distributions are only going to sample from a small number of states/actions. Whereas the uniform distribution more or less encodes that the reward functions are similar across most states/actions. So it’s encoding something about generalization.
I agree with your general sentiment but how does that translate formally to the math framework of the paper? Or in other words, where does their formulation diverge from reality?
Perhaps it’s in how they define the occupancy function over explicit (state, action) pairs—Seems like the occupancy measure doesn’t actually weight by state probability correctly? which seems odd—so you could have 2 reward functions that seem arbitrarily aligned (dot product close to 1), but only because they agree on the vast volume of highly improbable states, and not the tiny manifold of likely states.
Moreover in reality the state space is essentially infinite and everything must operate in a highly compressed model space for generalization regardless, so even if the ‘true’ unknown utility function can be defined over the (potentially infinite) state space, any practical reward proxy can not—it is a function of some limited low dimensional encoding of the state space, such that the mapping from that to the full state space is highly nonlinear and complex. We can’t realistically expand that function to the true full state space, nor expect the linearity to translate into the compressed model space. Tiny changes in the model space can translate to arbitrary jumps in the full state space, updates to the model compression function (ontology shifts) can shift everything around, etc.
So here’s a thing that I think John is pointing at, with a bit more math?:
The diversion is in the distance function.
- In the paper, we define the distance between rewards as the angle between reward vectors.
- So what we sort of do is look at the “dot product”, i.E., look at E[R1(S,A)⋅R2(S,A)] for true and proxy rewards R1 and R2 with states/actions sampled according to a uniform distribution. I give justification as to why this is a natural way to define distance in a separate comment.
But the issue here is that this isn’t the distribution of the actions/states we might see in practice.E[R1(S,A)⋅R2(S,A)] might be very high if states/actions are instead weighted by drawing them from a distribution induced from a certain policy (e.g., the policy of “killing lots of snakes without doing anything sneaky to game the reward” in the examples, I think?). But then as people optimize, the policy changes and this number goes down. A uniform distribution is actually likely quite far from any state/action distributions we would see in practice.
In other words the way we formally define reward distance here will often not match how “close” two reward functions seem, and lots of cases of “Goodharting” are cases where two reward functions just seem close on a particular state/action distribution but aren’t close according to our distance metric.
This makes the results of the paper primarily useful for working towards training regimes where we optimize the proxy and can approximate distance, which is described in Appendix F of the paper. This is because as we optimize the proxy it will start to generalize, and then problems with over-optimization as described in the paper are going to start mattering a lot more.
So more concretely, this is work towards some sort of RLHF training regime that “provably” avoids Goodharting. The main issue is that a lot of the numbers we’re using are quite hard to approximate.
Thanks that clarifies somewhat, but guess I’ll need to read the paper—still a bit confused about the justification for a uniform distribution.
Defining utility functions over full world states seems fine (even if not practical at larger scale), and defining alignment as dot products over full trajectory/state space utility functions also seems fine, but only if using true expected utility (ie the actual bayesian posterior distribution over states). That of course can get arbitrarily complex.
But it also seems necessary in that for one to say that two utility functions are truly ‘close’ seems like that must cash out to closeness of (perhaps normalized) expected utilities given the true distribution of future trajectories.
Do you see a relation between the early stopping criteria and regularization/generalization of the proxy reward?
The reason we’re using a uniform distribution is that it follows naturally from the math, but maybe an intuitive explanation is the following: the reason this is weird is that most realistic distributions are only going to sample from a small number of states/actions. Whereas the uniform distribution more or less encodes that the reward functions are similar across most states/actions. So it’s encoding something about generalization.