Let me clarify the distinction I’m trying to point at:
First, Goodhart’s law applies to us when we’re optimizing a goal for ourselves, but we don’t know the exact goal. For example, if I’m trying to make myself happy, I might find a proxy of dancing, even though dancing isn’t literally the global optimum. This uses up time I could have used on the actual best solution. This can be bad, but it doesn’t seem that bad. I’m pretty corrigible to myself.
Second, Goodhart’s law applies to other agents who are instructed to maximize some proxy of what we want. This is bad. If it’s maximizing the proxy, then it’s ensuring it’s most able to maximize the proxy, which means it’s incentivized to stop us from doing things (unless the proxy specifically includes that—which safeguard is also vulnerable to misspecification; or is somehow otherwise more intelligently designed than the standard reward-maximization model). The agent is pursuing the proxy from its own perspective, not from ours.
I think this entropyish thing is also why Stuart’s makes his point that Goodhart applies to humans and not in general: It’s only because of the unique state humans are in (existing in a low entropy universe, having an unusually large amount of power) that Goodhart tends to hit us affect us.
Actually, I think I have I have a more precise description of the entropyish thing now. Goodhart’s Law isn’t driven by entropy; Goodhart’s Law is driven by trying to optimize a utility function that already has an unusually high value relative to what you’d expect from your universe. Entropy just happens to be a reasonable proxy for it sometimes.
I don’t think the intial value has much to do with what you label the “AIS version” of Goodhart (neither does the complexity of human values in particular). Imagine we had a reward function that gave one point of reward for each cone detecting red; reward is dispensed once per second. Imagine that the universe is presently low-value; for whatever reason, red stimulation is hard to find. Goodhart’s law still applies to agents we build to ensure we can see red forever, but it doesn’t apply to us directly—we presumably deduce our true reward function, and no longer rely on proxies to maximize it.
The reason it applies to agents we build is that not only do you have to encode the reward function, but we have to point to people! This does not have a short description length. With respect to hard maximizers, a single misstep means the agent is now showing itself red, or something.
How proxies interact is worth considering, but (IMO) it’s far from the main reason for Goodhart’s law being really, really bad in the context of AI safety.
Oh I see where you’re coming from now. I’ll admit that, when I made my earlier post, I forgot about the full implications of instrumental convergence. Specifically, the part where:
Maximizing X minimizes alll Not X insofar as they both compete for the same resource pool.
Even if your resources are unusually low relative to where you’re positioned in the universe, an AI will still take that away from you. Optimizing one utility function doesn’t just randomly affect the optimization of other utility functions; they are anti-correlated in general
Well, they’re anti-correlated across different agents. But from the same agent’s perspective, they may still be able to maximize their own red-seeing, or even human red-seeing—they just won’t. (This will be in the next part of my sequence on impact).
Well, they’re anti-correlated across different agents. But from the same agent’s perspective, they may still be able to maximize their own red-seeing, or even human red-seeing—they just won’t
Just making sure I can parse this… When I say that they’re anti-correlated, I mean that the policy of maximizing X is akin to the policy of minimizing X to the extent that X and not X will at some point compete for the same instrumental resources. I will agree with the statement that an agent maximizing X who possesses many instrumental resources can use them to accomplish not X (and ,in this sense, the agent doesn’t perceive X nd not X as anti-correlated); and I’ll also agree that an agent optimizing X and another optimizing not X will be competitive for instrumental resources and view those things as anti-correlated.
they may still be able to maximize their own red-seeing, or even human red-seeing—they just won’t
I think some of this is a matter of semantics but I think I agree with this. There are also two different definitions of the word able here:
Able #1 : The extent to which it is possible for an agent to achieve X across all possible universes we think we might reside in
Able #2 : The extent to which it is possble for an agent to achieve X in a counterfactual where the agent has a goal of achieving X
I think you’re using Able #2 (which makes sense—it’s how the word is used colloquially). I tend to use Able #1 (because I read a lot about determinism when I was younger). I might be wrong about this though because you made a similar distinction between physical capability and anticipated possibility like this in Gears of Impact:
People have a natural sense of what they “could” do. If you’re sad, it still feels like you “could” do a ton of work anyways. It doesn’t feel physically impossible.”
...
Imagine suddenly becoming not-sad. Now you “could” work when you’re sad, and you “could” work when you’re not-sad, so if AU just compared the things you “could” do, you wouldn’t feel impact here.
I think you’re using Able #2 (which makes sense—it’s how the word is used colloquially). I tend to use Able #1 (because I read a lot about determinism when I was younger). I might be wrong about this though because you made a similar distinction between physical capability and anticipated possibility like this in Gears of Impact:
I am using #2, but I’m aware that there’s a separate #1 meaning (and thank you for distinguishing between them so clearly, here!).
Let me clarify the distinction I’m trying to point at:
First, Goodhart’s law applies to us when we’re optimizing a goal for ourselves, but we don’t know the exact goal. For example, if I’m trying to make myself happy, I might find a proxy of dancing, even though dancing isn’t literally the global optimum. This uses up time I could have used on the actual best solution. This can be bad, but it doesn’t seem that bad. I’m pretty corrigible to myself.
Second, Goodhart’s law applies to other agents who are instructed to maximize some proxy of what we want. This is bad. If it’s maximizing the proxy, then it’s ensuring it’s most able to maximize the proxy, which means it’s incentivized to stop us from doing things (unless the proxy specifically includes that—which safeguard is also vulnerable to misspecification; or is somehow otherwise more intelligently designed than the standard reward-maximization model). The agent is pursuing the proxy from its own perspective, not from ours.
I don’t think the intial value has much to do with what you label the “AIS version” of Goodhart (neither does the complexity of human values in particular). Imagine we had a reward function that gave one point of reward for each cone detecting red; reward is dispensed once per second. Imagine that the universe is presently low-value; for whatever reason, red stimulation is hard to find. Goodhart’s law still applies to agents we build to ensure we can see red forever, but it doesn’t apply to us directly—we presumably deduce our true reward function, and no longer rely on proxies to maximize it.
The reason it applies to agents we build is that not only do you have to encode the reward function, but we have to point to people! This does not have a short description length. With respect to hard maximizers, a single misstep means the agent is now showing itself red, or something.
How proxies interact is worth considering, but (IMO) it’s far from the main reason for Goodhart’s law being really, really bad in the context of AI safety.
Oh I see where you’re coming from now. I’ll admit that, when I made my earlier post, I forgot about the full implications of instrumental convergence. Specifically, the part where:
Maximizing X minimizes alll Not X insofar as they both compete for the same resource pool.
Even if your resources are unusually low relative to where you’re positioned in the universe, an AI will still take that away from you. Optimizing one utility function doesn’t just randomly affect the optimization of other utility functions; they are anti-correlated in general
I really gotta re-read Goodhart’s Taxonomy for a fourth time...
Well, they’re anti-correlated across different agents. But from the same agent’s perspective, they may still be able to maximize their own red-seeing, or even human red-seeing—they just won’t. (This will be in the next part of my sequence on impact).
Just making sure I can parse this… When I say that they’re anti-correlated, I mean that the policy of maximizing X is akin to the policy of minimizing X to the extent that X and not X will at some point compete for the same instrumental resources. I will agree with the statement that an agent maximizing X who possesses many instrumental resources can use them to accomplish not X (and ,in this sense, the agent doesn’t perceive X nd not X as anti-correlated); and I’ll also agree that an agent optimizing X and another optimizing not X will be competitive for instrumental resources and view those things as anti-correlated.
I think some of this is a matter of semantics but I think I agree with this. There are also two different definitions of the word able here:
Able #1 : The extent to which it is possible for an agent to achieve X across all possible universes we think we might reside in
Able #2 : The extent to which it is possble for an agent to achieve X in a counterfactual where the agent has a goal of achieving X
I think you’re using Able #2 (which makes sense—it’s how the word is used colloquially). I tend to use Able #1 (because I read a lot about determinism when I was younger). I might be wrong about this though because you made a similar distinction between physical capability and anticipated possibility like this in Gears of Impact:
I am using #2, but I’m aware that there’s a separate #1 meaning (and thank you for distinguishing between them so clearly, here!).