This is only true for the kind of things humans typically care about; this is not true for utility functions in general. That’s the extra info we have.
While I generally agree that there can be utility functions that aren’t subject to Goodhart, I don’t think that this strictly pertains to humans. I expect that when the vast majority of agents (human or not) use scientific methods to develop a proxy for the thing it wants to optimize, they will found that proxy to break down upon intense optimization:
-proxies are learned in a certain environment where it works to predict the utility function
-aggressively optimizing anything enough will usually change the environment dramatically
-so aggressively optimizing a given proxy will eventually violate the assumptions under which the proxy was created
-if the assumptions that justify the proxy’s design don’t hold, optimizing it further will be akin to acting randomly. This can be achieved by the “doing nothing” policy without the added spending of resources
-when the world is in a state where agentic actions have increased the value of a utility function, behaving randomly seems more likely to reduce the utility function than increasing it in the same way that randomness tends to push worlds towards states of higher entropy rather than lower ones.
The last point is kind-of handwaivey since we can have a utility function like “maximize entropy” which can provide many proxies which don’t get Goodhart’d (in the sense of optimization making things worse rather than just not making them better). Still, “Goodhart’s Law applies to agents with utility functions of relatively low entropy” is much more generic than “Goodhart’s Law applies to humans.” I’m also not sure how helpful that is. Even if we know that we should stop optimizing at some point, what metric do you actually use in making the decision to stop?
The explanation is a bit simpler than this. The agent has one goal, and we have other goals. It gains power to best complete its goal by taking power away from us. Therefore, any universe where we have an effective maximizer of something misspecified is a universe where we’re no longer able to get what we want. That’s why instrumental convergence is so bad.
-------------------------------------Part 1: I Respond to Your Actual Comment----------------------------------------
The explanation is a bit simpler than this. The agent has one goal, and we have other goals. It gains power to best complete its goal by taking power away from us
I don’t think this explanation is in conflict with mine. Much of my explanation (ie, the “optimizing a proxy too aggressively will invalidate the assumptions that the proxy was built on”) is focused on explaining why we expect proxies to become mis-specified. In the context of AGI, this isn’t that important because we have such low confidence in our ability to specify our values. However this model is more general and can help explain why we expect to make many mistakes when trying to specify our values:
Because we haven’t had our values tested in the kinds of universes that aggressive optimization might produce, our proxies will fail to account for as-of-now unmeasured factors in the things we care about.
You also mention power. I think this is a subset of the vague entropy thing I was being handwaivey about because:
1. A relatively low entropy universe is a necessary but not sufficient condition for humans having power. Thus, humans having power (and humans existing) implies that the universe has relatively low entropy.
2. This implies that acting randomly will tend to lessen human power rather than increase it (since randomness will tend to increase entropy)
I think this entropyish thing is also why Stuart’s makes his point that Goodhart applies to humans and not in general: It’s only because of the unique state humans are in (existing in a low entropy universe, having an unusually large amount of power) that Goodhart tends to hit us affect us.
Actually, I think I have I have a more precise description of the entropyish thing now. Goodhart’s Law isn’t driven by entropy; Goodhart’s Law is driven by trying to optimize a utility function that already has an unusually high value relative to what you’d expect from your universe.Entropy just happens to be a reasonable proxy for it sometimes.
---Part 2: Goodhart’s Law in a Simple, Semi-Realistic, Non-Adversarial Linear Optimization Problem-----
So, writing the response above gave me a bunch of ideas. To lead with, it’s worth noting that problems like this can happen in non-adversarial contexts too.
Example:
Say you’re in a world where your utility function is U=x+2y but you’ve only ever existed in environments where y=10 and x varies between 1 and 4 (currently, x=2). As a result, you decide to optimize the proxy V=x because you have no idea that y matters.
At first, this optimization works great. You were initially at U=x+2y=2+2⋅10=22. But you’ve progressively increased x more than it’s ever been increased before, all the way up to x=10. Your utility function is now U=x+2y=10+2⋅10=30; your proxy is V=10; and you’re much happier now than you’ve ever been before.
However, unbeknownst to you, Both x and y come from the same resource pool and so are constrained by the relation x+y≤20 . You continue optimizing your proxy all the way up to x=20, which (inadvertently) causes y to drop to y=0 . Your proxy now outputs V=20 but your utility function outputs U=20+10⋅0=20, which is lower than its initial value of 22. Despite initially improving your utility by 8 (at x=10, y=10), the optimization process ultimately reduces it by two once resource pool limitations begin to apply. Also note that, if the resource pool was not underexploited to begin with (x+y=2+10≤20 ), the optimization process would have immediately began to reduce utility by trading off the more valuable y for the less valuable x.
In short, Goodhart’s Law isn’t just adversarial. It can also happen in circumstances when:
1. Two desireable things compete for the same resource pool
2. One of the things is more desireable than the other
3. The amount of the more desireable thing has never changed; so no one has noticed that they have become more unhappy when it decreases or more happy when it increases
In this scenario, any proxy will necessarily fail at some point because the good thing gets traded away for the bad thing. The specific point at which optimization failure begins occuring depends both on how much more desireable the good thing is and how limited the pool of resources is.
This is why I don’t think that we can use knowledge of Goodhart’s Law alone to prevent Goodhart’s Law. In the above example, even knowing the functional form of the utility function (a sum of linear terms), knowing exactly the number of factors missed by the proxy (one factor: y), and knowing the functional form of the constraint (a linear less-than-or-equal-to) won’t tell us when to stop optimizing because:
If the resource pool has been completely exploited; we shouldn’t even start optimizing since it will trade-off an unknowedly good thing for a less good thing
If the resource pool is unlimited, we should never stop optimizing because we can always get more of a get thing
If the thing missed by the proxy is less desireable than the thing the proxy is considering, we should also never stop optimizing because it will trade-off an unknowedly less good thing for the good thing
And this is just a toy example; real utility functions without knowledge of the target function’s functional form or knowledge about the constraints are even more difficult.
Of course, if you notice that the optimization process is making you unhappy; you can stop it early and avert the negative effects yourself. Unfortunately, noticing that the process is making you unhappy requires that you have access to your utility function (or at least better proxy for it than the one you’re trying to optimize). By nature, this access cannot be given to an AI and this is a big problem.
Let me clarify the distinction I’m trying to point at:
First, Goodhart’s law applies to us when we’re optimizing a goal for ourselves, but we don’t know the exact goal. For example, if I’m trying to make myself happy, I might find a proxy of dancing, even though dancing isn’t literally the global optimum. This uses up time I could have used on the actual best solution. This can be bad, but it doesn’t seem that bad. I’m pretty corrigible to myself.
Second, Goodhart’s law applies to other agents who are instructed to maximize some proxy of what we want. This is bad. If it’s maximizing the proxy, then it’s ensuring it’s most able to maximize the proxy, which means it’s incentivized to stop us from doing things (unless the proxy specifically includes that—which safeguard is also vulnerable to misspecification; or is somehow otherwise more intelligently designed than the standard reward-maximization model). The agent is pursuing the proxy from its own perspective, not from ours.
I think this entropyish thing is also why Stuart’s makes his point that Goodhart applies to humans and not in general: It’s only because of the unique state humans are in (existing in a low entropy universe, having an unusually large amount of power) that Goodhart tends to hit us affect us.
Actually, I think I have I have a more precise description of the entropyish thing now. Goodhart’s Law isn’t driven by entropy; Goodhart’s Law is driven by trying to optimize a utility function that already has an unusually high value relative to what you’d expect from your universe. Entropy just happens to be a reasonable proxy for it sometimes.
I don’t think the intial value has much to do with what you label the “AIS version” of Goodhart (neither does the complexity of human values in particular). Imagine we had a reward function that gave one point of reward for each cone detecting red; reward is dispensed once per second. Imagine that the universe is presently low-value; for whatever reason, red stimulation is hard to find. Goodhart’s law still applies to agents we build to ensure we can see red forever, but it doesn’t apply to us directly—we presumably deduce our true reward function, and no longer rely on proxies to maximize it.
The reason it applies to agents we build is that not only do you have to encode the reward function, but we have to point to people! This does not have a short description length. With respect to hard maximizers, a single misstep means the agent is now showing itself red, or something.
How proxies interact is worth considering, but (IMO) it’s far from the main reason for Goodhart’s law being really, really bad in the context of AI safety.
Oh I see where you’re coming from now. I’ll admit that, when I made my earlier post, I forgot about the full implications of instrumental convergence. Specifically, the part where:
Maximizing X minimizes alll Not X insofar as they both compete for the same resource pool.
Even if your resources are unusually low relative to where you’re positioned in the universe, an AI will still take that away from you. Optimizing one utility function doesn’t just randomly affect the optimization of other utility functions; they are anti-correlated in general
Well, they’re anti-correlated across different agents. But from the same agent’s perspective, they may still be able to maximize their own red-seeing, or even human red-seeing—they just won’t. (This will be in the next part of my sequence on impact).
Well, they’re anti-correlated across different agents. But from the same agent’s perspective, they may still be able to maximize their own red-seeing, or even human red-seeing—they just won’t
Just making sure I can parse this… When I say that they’re anti-correlated, I mean that the policy of maximizing X is akin to the policy of minimizing X to the extent that X and not X will at some point compete for the same instrumental resources. I will agree with the statement that an agent maximizing X who possesses many instrumental resources can use them to accomplish not X (and ,in this sense, the agent doesn’t perceive X nd not X as anti-correlated); and I’ll also agree that an agent optimizing X and another optimizing not X will be competitive for instrumental resources and view those things as anti-correlated.
they may still be able to maximize their own red-seeing, or even human red-seeing—they just won’t
I think some of this is a matter of semantics but I think I agree with this. There are also two different definitions of the word able here:
Able #1 : The extent to which it is possible for an agent to achieve X across all possible universes we think we might reside in
Able #2 : The extent to which it is possble for an agent to achieve X in a counterfactual where the agent has a goal of achieving X
I think you’re using Able #2 (which makes sense—it’s how the word is used colloquially). I tend to use Able #1 (because I read a lot about determinism when I was younger). I might be wrong about this though because you made a similar distinction between physical capability and anticipated possibility like this in Gears of Impact:
People have a natural sense of what they “could” do. If you’re sad, it still feels like you “could” do a ton of work anyways. It doesn’t feel physically impossible.”
...
Imagine suddenly becoming not-sad. Now you “could” work when you’re sad, and you “could” work when you’re not-sad, so if AU just compared the things you “could” do, you wouldn’t feel impact here.
I think you’re using Able #2 (which makes sense—it’s how the word is used colloquially). I tend to use Able #1 (because I read a lot about determinism when I was younger). I might be wrong about this though because you made a similar distinction between physical capability and anticipated possibility like this in Gears of Impact:
I am using #2, but I’m aware that there’s a separate #1 meaning (and thank you for distinguishing between them so clearly, here!).
2. Goodhart’s Law as we deal with it in AI Safety: When a measure becomes a target, it actively causes you to miss the target. AKA, proxies of utility functions, when optimized too aggressively, will reduce the value outputs of those utility functions from where they were originally.
The traditional Goodhart’s Law strikes me as pretty general over a broad range of agents trying to optimize things.
The AI Safety version strikes me as pretty common too. But it’s contingent on the agent’s relationship with the universe in a way that the traditional version is not (ie, that agent already having an unusually high utility function relative to what you’d expect from the universe they’re in)
[Retracted my other reply due to math errors]
While I generally agree that there can be utility functions that aren’t subject to Goodhart, I don’t think that this strictly pertains to humans. I expect that when the vast majority of agents (human or not) use scientific methods to develop a proxy for the thing it wants to optimize, they will found that proxy to break down upon intense optimization:
-proxies are learned in a certain environment where it works to predict the utility function
-aggressively optimizing anything enough will usually change the environment dramatically
-so aggressively optimizing a given proxy will eventually violate the assumptions under which the proxy was created
-if the assumptions that justify the proxy’s design don’t hold, optimizing it further will be akin to acting randomly. This can be achieved by the “doing nothing” policy without the added spending of resources
-when the world is in a state where agentic actions have increased the value of a utility function, behaving randomly seems more likely to reduce the utility function than increasing it in the same way that randomness tends to push worlds towards states of higher entropy rather than lower ones.
The last point is kind-of handwaivey since we can have a utility function like “maximize entropy” which can provide many proxies which don’t get Goodhart’d (in the sense of optimization making things worse rather than just not making them better). Still, “Goodhart’s Law applies to agents with utility functions of relatively low entropy” is much more generic than “Goodhart’s Law applies to humans.” I’m also not sure how helpful that is. Even if we know that we should stop optimizing at some point, what metric do you actually use in making the decision to stop?
The explanation is a bit simpler than this. The agent has one goal, and we have other goals. It gains power to best complete its goal by taking power away from us. Therefore, any universe where we have an effective maximizer of something misspecified is a universe where we’re no longer able to get what we want. That’s why instrumental convergence is so bad.
This adversarial issue can be true, and is critical, but I don’t think it’s what Stuart was pointing to. in his post or his reply.
-------------------------------------Part 1: I Respond to Your Actual Comment----------------------------------------
I don’t think this explanation is in conflict with mine. Much of my explanation (ie, the “optimizing a proxy too aggressively will invalidate the assumptions that the proxy was built on”) is focused on explaining why we expect proxies to become mis-specified. In the context of AGI, this isn’t that important because we have such low confidence in our ability to specify our values. However this model is more general and can help explain why we expect to make many mistakes when trying to specify our values:
Because we haven’t had our values tested in the kinds of universes that aggressive optimization might produce, our proxies will fail to account for as-of-now unmeasured factors in the things we care about.
You also mention power. I think this is a subset of the vague entropy thing I was being handwaivey about because:
1. A relatively low entropy universe is a necessary but not sufficient condition for humans having power. Thus, humans having power (and humans existing) implies that the universe has relatively low entropy.
2. This implies that acting randomly will tend to lessen human power rather than increase it (since randomness will tend to increase entropy)
I think this entropyish thing is also why Stuart’s makes his point that Goodhart applies to humans and not in general: It’s only because of the unique state humans are in (existing in a low entropy universe, having an unusually large amount of power) that Goodhart tends to hit us affect us.
Actually, I think I have I have a more precise description of the entropyish thing now. Goodhart’s Law isn’t driven by entropy; Goodhart’s Law is driven by trying to optimize a utility function that already has an unusually high value relative to what you’d expect from your universe. Entropy just happens to be a reasonable proxy for it sometimes.
---Part 2: Goodhart’s Law in a Simple, Semi-Realistic, Non-Adversarial Linear Optimization Problem-----
So, writing the response above gave me a bunch of ideas. To lead with, it’s worth noting that problems like this can happen in non-adversarial contexts too.
Example:
Say you’re in a world where your utility function is U=x+2y but you’ve only ever existed in environments where y=10 and x varies between 1 and 4 (currently, x=2). As a result, you decide to optimize the proxy V=x because you have no idea that y matters.
At first, this optimization works great. You were initially at U=x+2y=2+2⋅10=22. But you’ve progressively increased x more than it’s ever been increased before, all the way up to x=10. Your utility function is now U=x+2y=10+2⋅10=30; your proxy is V=10; and you’re much happier now than you’ve ever been before.
However, unbeknownst to you, Both x and y come from the same resource pool and so are constrained by the relation x+y≤20 . You continue optimizing your proxy all the way up to x=20, which (inadvertently) causes y to drop to y=0 . Your proxy now outputs V=20 but your utility function outputs U=20+10⋅0=20, which is lower than its initial value of 22. Despite initially improving your utility by 8 (at x=10, y=10), the optimization process ultimately reduces it by two once resource pool limitations begin to apply. Also note that, if the resource pool was not underexploited to begin with (x+y=2+10≤20 ), the optimization process would have immediately began to reduce utility by trading off the more valuable y for the less valuable x.
In short, Goodhart’s Law isn’t just adversarial. It can also happen in circumstances when:
1. Two desireable things compete for the same resource pool
2. One of the things is more desireable than the other
3. The amount of the more desireable thing has never changed; so no one has noticed that they have become more unhappy when it decreases or more happy when it increases
In this scenario, any proxy will necessarily fail at some point because the good thing gets traded away for the bad thing. The specific point at which optimization failure begins occuring depends both on how much more desireable the good thing is and how limited the pool of resources is.
This is why I don’t think that we can use knowledge of Goodhart’s Law alone to prevent Goodhart’s Law. In the above example, even knowing the functional form of the utility function (a sum of linear terms), knowing exactly the number of factors missed by the proxy (one factor: y), and knowing the functional form of the constraint (a linear less-than-or-equal-to) won’t tell us when to stop optimizing because:
If the resource pool has been completely exploited; we shouldn’t even start optimizing since it will trade-off an unknowedly good thing for a less good thing
If the resource pool is unlimited, we should never stop optimizing because we can always get more of a get thing
If the thing missed by the proxy is less desireable than the thing the proxy is considering, we should also never stop optimizing because it will trade-off an unknowedly less good thing for the good thing
And this is just a toy example; real utility functions without knowledge of the target function’s functional form or knowledge about the constraints are even more difficult.
Of course, if you notice that the optimization process is making you unhappy; you can stop it early and avert the negative effects yourself. Unfortunately, noticing that the process is making you unhappy requires that you have access to your utility function (or at least better proxy for it than the one you’re trying to optimize). By nature, this access cannot be given to an AI and this is a big problem.
Let me clarify the distinction I’m trying to point at:
First, Goodhart’s law applies to us when we’re optimizing a goal for ourselves, but we don’t know the exact goal. For example, if I’m trying to make myself happy, I might find a proxy of dancing, even though dancing isn’t literally the global optimum. This uses up time I could have used on the actual best solution. This can be bad, but it doesn’t seem that bad. I’m pretty corrigible to myself.
Second, Goodhart’s law applies to other agents who are instructed to maximize some proxy of what we want. This is bad. If it’s maximizing the proxy, then it’s ensuring it’s most able to maximize the proxy, which means it’s incentivized to stop us from doing things (unless the proxy specifically includes that—which safeguard is also vulnerable to misspecification; or is somehow otherwise more intelligently designed than the standard reward-maximization model). The agent is pursuing the proxy from its own perspective, not from ours.
I don’t think the intial value has much to do with what you label the “AIS version” of Goodhart (neither does the complexity of human values in particular). Imagine we had a reward function that gave one point of reward for each cone detecting red; reward is dispensed once per second. Imagine that the universe is presently low-value; for whatever reason, red stimulation is hard to find. Goodhart’s law still applies to agents we build to ensure we can see red forever, but it doesn’t apply to us directly—we presumably deduce our true reward function, and no longer rely on proxies to maximize it.
The reason it applies to agents we build is that not only do you have to encode the reward function, but we have to point to people! This does not have a short description length. With respect to hard maximizers, a single misstep means the agent is now showing itself red, or something.
How proxies interact is worth considering, but (IMO) it’s far from the main reason for Goodhart’s law being really, really bad in the context of AI safety.
Oh I see where you’re coming from now. I’ll admit that, when I made my earlier post, I forgot about the full implications of instrumental convergence. Specifically, the part where:
Maximizing X minimizes alll Not X insofar as they both compete for the same resource pool.
Even if your resources are unusually low relative to where you’re positioned in the universe, an AI will still take that away from you. Optimizing one utility function doesn’t just randomly affect the optimization of other utility functions; they are anti-correlated in general
I really gotta re-read Goodhart’s Taxonomy for a fourth time...
Well, they’re anti-correlated across different agents. But from the same agent’s perspective, they may still be able to maximize their own red-seeing, or even human red-seeing—they just won’t. (This will be in the next part of my sequence on impact).
Just making sure I can parse this… When I say that they’re anti-correlated, I mean that the policy of maximizing X is akin to the policy of minimizing X to the extent that X and not X will at some point compete for the same instrumental resources. I will agree with the statement that an agent maximizing X who possesses many instrumental resources can use them to accomplish not X (and ,in this sense, the agent doesn’t perceive X nd not X as anti-correlated); and I’ll also agree that an agent optimizing X and another optimizing not X will be competitive for instrumental resources and view those things as anti-correlated.
I think some of this is a matter of semantics but I think I agree with this. There are also two different definitions of the word able here:
Able #1 : The extent to which it is possible for an agent to achieve X across all possible universes we think we might reside in
Able #2 : The extent to which it is possble for an agent to achieve X in a counterfactual where the agent has a goal of achieving X
I think you’re using Able #2 (which makes sense—it’s how the word is used colloquially). I tend to use Able #1 (because I read a lot about determinism when I was younger). I might be wrong about this though because you made a similar distinction between physical capability and anticipated possibility like this in Gears of Impact:
I am using #2, but I’m aware that there’s a separate #1 meaning (and thank you for distinguishing between them so clearly, here!).
I just wanted to add that, technically speaking, there are two levels of Goodhart’s Law worth discussing here:
1. Goodhart’s Law as traditionally defined: “When a measure becomes a target, it ceases to be a good measure.” AKA, proxies of utility functions, when optimized too aggressively, stop being proxies for those utility functions.
2. Goodhart’s Law as we deal with it in AI Safety: When a measure becomes a target, it actively causes you to miss the target. AKA, proxies of utility functions, when optimized too aggressively, will reduce the value outputs of those utility functions from where they were originally.
The traditional Goodhart’s Law strikes me as pretty general over a broad range of agents trying to optimize things.
The AI Safety version strikes me as pretty common too. But it’s contingent on the agent’s relationship with the universe in a way that the traditional version is not (ie, that agent already having an unusually high utility function relative to what you’d expect from the universe they’re in)
Yep, those are the two levels I mentioned :-)
But I like your phrasing.