I’m fairly pessimistic on our ability to build aligned AI. My take is roughly that it’s theoretically impossible and at best we might build AI that is aligned well enough that we don’t lose. I’ve not written one thing to really summarize this or prove it, though.
The source of my take comes from two facts:
Goodharting is robust. That is, the mechanism of Goodharting seems impossible to overcome. Goodharting is just a fact of any control system.
It’s impossible to infer the inner experience (and thus values) of another being perfectly without making normative assumptions.
Stuart Armstrong has made a case for (2) with his no free lunch theorem. I’ve not seen anyone formally make the case for (1), though.
Is this something worth trying to prove? That Goodharting is unavoidable and at most we can try to contain its effects?
I’m many years out from doing math full time so I’m not sure if I could make a rigorous proof of it, but this seems to be something that people disagree on sometimes (arguing that Goodharting can be overcome) but I think most of those discussions don’t get very precise about what that means.
This paper gives a mathematical model of when Goodharting will occur. To summarize: if
(1) a human has some collection s1,…,sn of things which she values,
(2) a robot has access to a proxy utility function which takes into account some strict subset of those things, and
(3) the robot can freely vary how much of s1,…,sn there are in the world, subject only to resource constraints that make the si trade off against each other,
then when the robot optimizes for its proxy utility, it will minimize all si‘s which its proxy utility function doesn’t take into account. If you impose a further condition which ensures that you can’t get too much utility by only maximizing some strict subset of the si’s (e.g. assuming diminishing marginal returns), then the optimum found by the robot will be suboptimal for the human’s true utility function.
That said, I wasn’t super-impressed by this paper—the above is pretty obvious and the mathematical model doesn’t elucidate anything, IMO.
Moreover, I think this model doesn’t interact much with the skeptical take about whether Goodhart’s Law implies doom in practice. Namely, here are some things I believe about the world which this model doesn’t take into account:
(1) Lots of the things we value are correlated with each other over “realistically attainable” distributions of world states. Or in other words, for many pairs si,sj of things we care about, it is hard (concretely, requires a very capable AI) to increase the amount of si without also increasing the amount of sj.
(2) The utility functions of future AIs will be learned from humans in such a way that as the capabilities of AI systems increase, so will their ability to model human preferences.
If (1) is true, then for each given capabilities level, there is some room for error for our proxy utility functions (within which an agent at that capabilities level won’t be able to decouple our proxy utility function from our true utility function); this permissible error margin shrinks with increasing capabilities. If you buy (2), then you might additionally think that the actual error margin between learned proxy utility functions and our true utility function will shrink more rapidly than the permissible error margin as AI capabilities grow. (Whether or not you actually do believe that value learning will beat capabilities in this race probably depends on a whole lot of other empirical beliefs, or so it seems to me.)
This thread (which you might have already seen) has some good discussion about whether Goodharting will be a big problem in practice.
I actually don’t think that model is general enough. Like, I think Goodharting is just a fact of control system’s observing.
Suppose we have a simple control system with output X and a governor G. G takes a measurement m(X) (an observation) of X. So long as m(X) is not error free (and I think we can agree that no real world system can be actually error free), then X=m(X)+ϵ for some error factor ϵ. Since G uses m(X) to regulate the system to change X, we now have error influencing the value of X. Now applying the standard reasoning for Goodhart, in the limit of optimization pressure (i.e.G regulating the value of X for long enough), ϵ comes to dominate the value of X.
This is a bit handwavy, but I’m pretty sure it’s true, which means in theory any attempt to optimize for anything will, under enough optimization pressure, become dominated by error, whether that’s human values or something else. The only interesting question is can we control the error enough, either through better measurement or less optimization pressure, such that we can get enough signal to be happy with the output.
Hmm, I’m not sure I understand—it doesn’t seem to me like noisy observations ought to pose a big problem to control systems in general.
For example, suppose we want to minimize the number of mosquitos in the U.S., and we access to noisy estimates of mosquito counts in each county. This may result in us allocating resources slightly inefficiently (e.g. overspending resources on counties that have fewer mosquitos than we think), but we’ll still always be doing the approximately correct thing and mosquito counts will go down. In particular, I don’t see a sense in which the error “comes to dominate” the thing we’re optimizing.
One concern which does make sense to me (and I’m not sure if I’m steelmanning your point or just saying something completely different) is that under extreme optimization pressure, measurements might become decoupled from the thing they’re supposed to measure. In the mosquito example, this would look like us bribing the surveyors to report artificially low mosquito counts instead of actually trying to affect real-world mosquito counts.
If this is your primary concern regarding Goodhart’s Law, then I agree the model above doesn’t obviously capture it. I guess it’s more precisely a model of proxy misspecification.
Can you explain where there is an error term in AlphaGo or where an error term might appear in hypothetical model similar to AlphaGo trained much longer with much more numerous parameters and computational resources?
AlphaGo is fairly constrained in what it’s designed to optimize for, but it still has the standard failure mode of “things we forgot to encode”. So for example AlphaGo could suffer the error of instrumental power grabbing in order to be able to get better at winning Go because we misspecified what we asked it to measure. This is a kind of failure introduced into the systems by humans failing to make m(X) adequately evaluate X as we intended, since we cared about winning Go games while also minimizing side effects, but maybe when we constructed m(X) we forgot about minimizing side effects.
At least one person here disagrees with you on Goodharting. (I do.)
You’ve written before on this site if I recall correctly that Eliezer’s 2004 CEV proposal is unworkable because of Goodharting. I am granting myself the luxury of not bothering to look up your previous statement because you can contradict me if my recollection is incorrect.
I believe that the CEV proposal is probably achievable by humans if those humans had enough time and enough resources (money, talent, protection from meddling) and that if it is not achievable, it is because of reasons other than Goodhart’s law.
(Sadly, an unaligned superintelligence is much easier for humans living in 2022 to create than a CEV-aligned superintelligence is, so we are probably all going to die IMHO.)
Perhaps before discussing the CEV proposal we should discuss a simpler question, namely, whether you believe that Goodharting inevitably ruins the plans of any group setting out intentionally to create a superintelligent paperclip maximizer.
Another simple goal we might discuss is a superintelligence (SI) whose goal is to shove as much matter as possible into a black hole or an SI that “shuts itself off” within 3 months of its launching where “shuts itself off” means stops trying to survive or to affect reality in any way.
The reason Eliezer’s 2004 “coherent extrapolated volition” (CEV) proposal is immune to Goodharting is probably because being immune to it was probably one of the main criteria for its creation. I.e., Eliezer came up with it through a process of looking for a design immune to Goodharting. It may very well be that all other published proposals for aligning super-intelligent AI are vulnerable to Goodharting.
Goodhart’s law basically says that if we put too much optimization pressure on criterion X, then as a side effect, the optimization process drives criteria Y and Z, which we also care about, higher or lower than we consider reasonable. But that doesn’t apply when criterion X is “everything we value” or “the reflective equilibrium of everything we value”.
The problem of course being that although the CEV plan is probably within human capabilities to implement (and IMHO Scott Garrabrant’s work is probably a step forward) unaligned AI is probably significantly easier to implement, so will likely arrive first.
I’m fairly pessimistic on our ability to build aligned AI. My take is roughly that it’s theoretically impossible and at best we might build AI that is aligned well enough that we don’t lose. I’ve not written one thing to really summarize this or prove it, though.
The source of my take comes from two facts:
Goodharting is robust. That is, the mechanism of Goodharting seems impossible to overcome. Goodharting is just a fact of any control system.
It’s impossible to infer the inner experience (and thus values) of another being perfectly without making normative assumptions.
Stuart Armstrong has made a case for (2) with his no free lunch theorem. I’ve not seen anyone formally make the case for (1), though.
Is this something worth trying to prove? That Goodharting is unavoidable and at most we can try to contain its effects?
I’m many years out from doing math full time so I’m not sure if I could make a rigorous proof of it, but this seems to be something that people disagree on sometimes (arguing that Goodharting can be overcome) but I think most of those discussions don’t get very precise about what that means.
This paper gives a mathematical model of when Goodharting will occur. To summarize: if
(1) a human has some collection s1,…,sn of things which she values,
(2) a robot has access to a proxy utility function which takes into account some strict subset of those things, and
(3) the robot can freely vary how much of s1,…,sn there are in the world, subject only to resource constraints that make the si trade off against each other,
then when the robot optimizes for its proxy utility, it will minimize all si‘s which its proxy utility function doesn’t take into account. If you impose a further condition which ensures that you can’t get too much utility by only maximizing some strict subset of the si’s (e.g. assuming diminishing marginal returns), then the optimum found by the robot will be suboptimal for the human’s true utility function.
That said, I wasn’t super-impressed by this paper—the above is pretty obvious and the mathematical model doesn’t elucidate anything, IMO.
Moreover, I think this model doesn’t interact much with the skeptical take about whether Goodhart’s Law implies doom in practice. Namely, here are some things I believe about the world which this model doesn’t take into account:
(1) Lots of the things we value are correlated with each other over “realistically attainable” distributions of world states. Or in other words, for many pairs si,sj of things we care about, it is hard (concretely, requires a very capable AI) to increase the amount of si without also increasing the amount of sj.
(2) The utility functions of future AIs will be learned from humans in such a way that as the capabilities of AI systems increase, so will their ability to model human preferences.
If (1) is true, then for each given capabilities level, there is some room for error for our proxy utility functions (within which an agent at that capabilities level won’t be able to decouple our proxy utility function from our true utility function); this permissible error margin shrinks with increasing capabilities. If you buy (2), then you might additionally think that the actual error margin between learned proxy utility functions and our true utility function will shrink more rapidly than the permissible error margin as AI capabilities grow. (Whether or not you actually do believe that value learning will beat capabilities in this race probably depends on a whole lot of other empirical beliefs, or so it seems to me.)
This thread (which you might have already seen) has some good discussion about whether Goodharting will be a big problem in practice.
I actually don’t think that model is general enough. Like, I think Goodharting is just a fact of control system’s observing.
Suppose we have a simple control system with output X and a governor G. G takes a measurement m(X) (an observation) of X. So long as m(X) is not error free (and I think we can agree that no real world system can be actually error free), then X=m(X)+ϵ for some error factor ϵ. Since G uses m(X) to regulate the system to change X, we now have error influencing the value of X. Now applying the standard reasoning for Goodhart, in the limit of optimization pressure (i.e.G regulating the value of X for long enough), ϵ comes to dominate the value of X.
This is a bit handwavy, but I’m pretty sure it’s true, which means in theory any attempt to optimize for anything will, under enough optimization pressure, become dominated by error, whether that’s human values or something else. The only interesting question is can we control the error enough, either through better measurement or less optimization pressure, such that we can get enough signal to be happy with the output.
Hmm, I’m not sure I understand—it doesn’t seem to me like noisy observations ought to pose a big problem to control systems in general.
For example, suppose we want to minimize the number of mosquitos in the U.S., and we access to noisy estimates of mosquito counts in each county. This may result in us allocating resources slightly inefficiently (e.g. overspending resources on counties that have fewer mosquitos than we think), but we’ll still always be doing the approximately correct thing and mosquito counts will go down. In particular, I don’t see a sense in which the error “comes to dominate” the thing we’re optimizing.
One concern which does make sense to me (and I’m not sure if I’m steelmanning your point or just saying something completely different) is that under extreme optimization pressure, measurements might become decoupled from the thing they’re supposed to measure. In the mosquito example, this would look like us bribing the surveyors to report artificially low mosquito counts instead of actually trying to affect real-world mosquito counts.
If this is your primary concern regarding Goodhart’s Law, then I agree the model above doesn’t obviously capture it. I guess it’s more precisely a model of proxy misspecification.
“Error” here is all sources of error, not just error in the measurement equipment. So bribing surveyors is a kind of error in my model.
Can you explain where there is an error term in AlphaGo or where an error term might appear in hypothetical model similar to AlphaGo trained much longer with much more numerous parameters and computational resources?
AlphaGo is fairly constrained in what it’s designed to optimize for, but it still has the standard failure mode of “things we forgot to encode”. So for example AlphaGo could suffer the error of instrumental power grabbing in order to be able to get better at winning Go because we misspecified what we asked it to measure. This is a kind of failure introduced into the systems by humans failing to make m(X) adequately evaluate X as we intended, since we cared about winning Go games while also minimizing side effects, but maybe when we constructed m(X) we forgot about minimizing side effects.
At least one person here disagrees with you on Goodharting. (I do.)
You’ve written before on this site if I recall correctly that Eliezer’s 2004 CEV proposal is unworkable because of Goodharting. I am granting myself the luxury of not bothering to look up your previous statement because you can contradict me if my recollection is incorrect.
I believe that the CEV proposal is probably achievable by humans if those humans had enough time and enough resources (money, talent, protection from meddling) and that if it is not achievable, it is because of reasons other than Goodhart’s law.
(Sadly, an unaligned superintelligence is much easier for humans living in 2022 to create than a CEV-aligned superintelligence is, so we are probably all going to die IMHO.)
Perhaps before discussing the CEV proposal we should discuss a simpler question, namely, whether you believe that Goodharting inevitably ruins the plans of any group setting out intentionally to create a superintelligent paperclip maximizer.
Another simple goal we might discuss is a superintelligence (SI) whose goal is to shove as much matter as possible into a black hole or an SI that “shuts itself off” within 3 months of its launching where “shuts itself off” means stops trying to survive or to affect reality in any way.
The reason Eliezer’s 2004 “coherent extrapolated volition” (CEV) proposal is immune to Goodharting is probably because being immune to it was probably one of the main criteria for its creation. I.e., Eliezer came up with it through a process of looking for a design immune to Goodharting. It may very well be that all other published proposals for aligning super-intelligent AI are vulnerable to Goodharting.
Goodhart’s law basically says that if we put too much optimization pressure on criterion X, then as a side effect, the optimization process drives criteria Y and Z, which we also care about, higher or lower than we consider reasonable. But that doesn’t apply when criterion X is “everything we value” or “the reflective equilibrium of everything we value”.
The problem of course being that although the CEV plan is probably within human capabilities to implement (and IMHO Scott Garrabrant’s work is probably a step forward) unaligned AI is probably significantly easier to implement, so will likely arrive first.