This paper gives a mathematical model of when Goodharting will occur. To summarize: if
(1) a human has some collection s1,…,sn of things which she values,
(2) a robot has access to a proxy utility function which takes into account some strict subset of those things, and
(3) the robot can freely vary how much of s1,…,sn there are in the world, subject only to resource constraints that make the si trade off against each other,
then when the robot optimizes for its proxy utility, it will minimize all si‘s which its proxy utility function doesn’t take into account. If you impose a further condition which ensures that you can’t get too much utility by only maximizing some strict subset of the si’s (e.g. assuming diminishing marginal returns), then the optimum found by the robot will be suboptimal for the human’s true utility function.
That said, I wasn’t super-impressed by this paper—the above is pretty obvious and the mathematical model doesn’t elucidate anything, IMO.
Moreover, I think this model doesn’t interact much with the skeptical take about whether Goodhart’s Law implies doom in practice. Namely, here are some things I believe about the world which this model doesn’t take into account:
(1) Lots of the things we value are correlated with each other over “realistically attainable” distributions of world states. Or in other words, for many pairs si,sj of things we care about, it is hard (concretely, requires a very capable AI) to increase the amount of si without also increasing the amount of sj.
(2) The utility functions of future AIs will be learned from humans in such a way that as the capabilities of AI systems increase, so will their ability to model human preferences.
If (1) is true, then for each given capabilities level, there is some room for error for our proxy utility functions (within which an agent at that capabilities level won’t be able to decouple our proxy utility function from our true utility function); this permissible error margin shrinks with increasing capabilities. If you buy (2), then you might additionally think that the actual error margin between learned proxy utility functions and our true utility function will shrink more rapidly than the permissible error margin as AI capabilities grow. (Whether or not you actually do believe that value learning will beat capabilities in this race probably depends on a whole lot of other empirical beliefs, or so it seems to me.)
This thread (which you might have already seen) has some good discussion about whether Goodharting will be a big problem in practice.
I actually don’t think that model is general enough. Like, I think Goodharting is just a fact of control system’s observing.
Suppose we have a simple control system with output X and a governor G. G takes a measurement m(X) (an observation) of X. So long as m(X) is not error free (and I think we can agree that no real world system can be actually error free), then X=m(X)+ϵ for some error factor ϵ. Since G uses m(X) to regulate the system to change X, we now have error influencing the value of X. Now applying the standard reasoning for Goodhart, in the limit of optimization pressure (i.e.G regulating the value of X for long enough), ϵ comes to dominate the value of X.
This is a bit handwavy, but I’m pretty sure it’s true, which means in theory any attempt to optimize for anything will, under enough optimization pressure, become dominated by error, whether that’s human values or something else. The only interesting question is can we control the error enough, either through better measurement or less optimization pressure, such that we can get enough signal to be happy with the output.
Hmm, I’m not sure I understand—it doesn’t seem to me like noisy observations ought to pose a big problem to control systems in general.
For example, suppose we want to minimize the number of mosquitos in the U.S., and we access to noisy estimates of mosquito counts in each county. This may result in us allocating resources slightly inefficiently (e.g. overspending resources on counties that have fewer mosquitos than we think), but we’ll still always be doing the approximately correct thing and mosquito counts will go down. In particular, I don’t see a sense in which the error “comes to dominate” the thing we’re optimizing.
One concern which does make sense to me (and I’m not sure if I’m steelmanning your point or just saying something completely different) is that under extreme optimization pressure, measurements might become decoupled from the thing they’re supposed to measure. In the mosquito example, this would look like us bribing the surveyors to report artificially low mosquito counts instead of actually trying to affect real-world mosquito counts.
If this is your primary concern regarding Goodhart’s Law, then I agree the model above doesn’t obviously capture it. I guess it’s more precisely a model of proxy misspecification.
Can you explain where there is an error term in AlphaGo or where an error term might appear in hypothetical model similar to AlphaGo trained much longer with much more numerous parameters and computational resources?
AlphaGo is fairly constrained in what it’s designed to optimize for, but it still has the standard failure mode of “things we forgot to encode”. So for example AlphaGo could suffer the error of instrumental power grabbing in order to be able to get better at winning Go because we misspecified what we asked it to measure. This is a kind of failure introduced into the systems by humans failing to make m(X) adequately evaluate X as we intended, since we cared about winning Go games while also minimizing side effects, but maybe when we constructed m(X) we forgot about minimizing side effects.
This paper gives a mathematical model of when Goodharting will occur. To summarize: if
(1) a human has some collection s1,…,sn of things which she values,
(2) a robot has access to a proxy utility function which takes into account some strict subset of those things, and
(3) the robot can freely vary how much of s1,…,sn there are in the world, subject only to resource constraints that make the si trade off against each other,
then when the robot optimizes for its proxy utility, it will minimize all si‘s which its proxy utility function doesn’t take into account. If you impose a further condition which ensures that you can’t get too much utility by only maximizing some strict subset of the si’s (e.g. assuming diminishing marginal returns), then the optimum found by the robot will be suboptimal for the human’s true utility function.
That said, I wasn’t super-impressed by this paper—the above is pretty obvious and the mathematical model doesn’t elucidate anything, IMO.
Moreover, I think this model doesn’t interact much with the skeptical take about whether Goodhart’s Law implies doom in practice. Namely, here are some things I believe about the world which this model doesn’t take into account:
(1) Lots of the things we value are correlated with each other over “realistically attainable” distributions of world states. Or in other words, for many pairs si,sj of things we care about, it is hard (concretely, requires a very capable AI) to increase the amount of si without also increasing the amount of sj.
(2) The utility functions of future AIs will be learned from humans in such a way that as the capabilities of AI systems increase, so will their ability to model human preferences.
If (1) is true, then for each given capabilities level, there is some room for error for our proxy utility functions (within which an agent at that capabilities level won’t be able to decouple our proxy utility function from our true utility function); this permissible error margin shrinks with increasing capabilities. If you buy (2), then you might additionally think that the actual error margin between learned proxy utility functions and our true utility function will shrink more rapidly than the permissible error margin as AI capabilities grow. (Whether or not you actually do believe that value learning will beat capabilities in this race probably depends on a whole lot of other empirical beliefs, or so it seems to me.)
This thread (which you might have already seen) has some good discussion about whether Goodharting will be a big problem in practice.
I actually don’t think that model is general enough. Like, I think Goodharting is just a fact of control system’s observing.
Suppose we have a simple control system with output X and a governor G. G takes a measurement m(X) (an observation) of X. So long as m(X) is not error free (and I think we can agree that no real world system can be actually error free), then X=m(X)+ϵ for some error factor ϵ. Since G uses m(X) to regulate the system to change X, we now have error influencing the value of X. Now applying the standard reasoning for Goodhart, in the limit of optimization pressure (i.e.G regulating the value of X for long enough), ϵ comes to dominate the value of X.
This is a bit handwavy, but I’m pretty sure it’s true, which means in theory any attempt to optimize for anything will, under enough optimization pressure, become dominated by error, whether that’s human values or something else. The only interesting question is can we control the error enough, either through better measurement or less optimization pressure, such that we can get enough signal to be happy with the output.
Hmm, I’m not sure I understand—it doesn’t seem to me like noisy observations ought to pose a big problem to control systems in general.
For example, suppose we want to minimize the number of mosquitos in the U.S., and we access to noisy estimates of mosquito counts in each county. This may result in us allocating resources slightly inefficiently (e.g. overspending resources on counties that have fewer mosquitos than we think), but we’ll still always be doing the approximately correct thing and mosquito counts will go down. In particular, I don’t see a sense in which the error “comes to dominate” the thing we’re optimizing.
One concern which does make sense to me (and I’m not sure if I’m steelmanning your point or just saying something completely different) is that under extreme optimization pressure, measurements might become decoupled from the thing they’re supposed to measure. In the mosquito example, this would look like us bribing the surveyors to report artificially low mosquito counts instead of actually trying to affect real-world mosquito counts.
If this is your primary concern regarding Goodhart’s Law, then I agree the model above doesn’t obviously capture it. I guess it’s more precisely a model of proxy misspecification.
“Error” here is all sources of error, not just error in the measurement equipment. So bribing surveyors is a kind of error in my model.
Can you explain where there is an error term in AlphaGo or where an error term might appear in hypothetical model similar to AlphaGo trained much longer with much more numerous parameters and computational resources?
AlphaGo is fairly constrained in what it’s designed to optimize for, but it still has the standard failure mode of “things we forgot to encode”. So for example AlphaGo could suffer the error of instrumental power grabbing in order to be able to get better at winning Go because we misspecified what we asked it to measure. This is a kind of failure introduced into the systems by humans failing to make m(X) adequately evaluate X as we intended, since we cared about winning Go games while also minimizing side effects, but maybe when we constructed m(X) we forgot about minimizing side effects.