I find this way of formalizing Goodhart weird. Is there a standard formalization of it, or is this your invention? I’ll explain what I think is weird.
You define U and V such that you can calculate U—V to find W, but this appears to me to skip right past the most pernicious bit of Goodhart, which is that U is only knowable via a measurement (not necessarily a measure), such that I would say V=μ(U) for some “measuring” function μ:U→R and the problem is that μ(U) is correlated with but different from U since there may not even be a way to compare U.
To make it concrete with an example, suppose U is “beauty as defined by Gordon”. We don’t, at least as of yet, have a way to find U directly, and maybe we never will. So supposing we don’t, if we want to answer questions like “would Gordon find this beautiful?” and “what painting would Gordon most like?” we need to a measurement of U we can work with, as developed by, say, using IRL to discover a “beauty function” that describes U such that we could say how beautiful I would think something is. But we would be hard pressed to be precise about how far off the beauty function is from my sense of beauty because we only have a very gross measure of the difference: compare how beautiful the beauty function and I think some finite set of things are (finite because I’m a bounded, embedded agent who is never going to get to see all things, even if the beauty function somehow could), and even as we are doing this we are still getting a measurement of my internal sense of beauty rather than my internal sense of beauty itself because we are asking me to say how beautiful I think something is rather than directly observing my sense of beauty. This is much of why I expect that Goodhart is extremely robust.
I think you want to differentiate between different mechanisms for Goodhart’s law. The categorization that Scott Garrabrant put together, and I worked with him on refining, is here: https://arxiv.org/abs/1803.04585
Given that, I see several different things going on.
First, if I read the post correctly, Stuart is discussing regressional Goodhart, in this case the general issue of what Greg Lewis called “The Tails Come Apart”. This occurs whether or not the true value function is known. (As a historical note, this is a broader and, as Scott pointed out, a more fundamentally unavoidable claim than either what Goodhart meant, or what Campbell was referring to.)
Second, there is the potential for divergences between, in your example,”a measurement of my internal sense of beauty rather than my internal sense of beauty itself” is a second Goodhart affect, which is (at least) a causal one, where repeated queries change the estimates due to psychological biases, etc. In that case, there’s also a nasty potential adversarial Goodhart issue, if the AI gets to make the queries and exploits those biases.
Alternatively, if the initial sample of “your internal sense of beauty” is a fixed sample, there is a sampling and inference issue for the preferences for embedded agents—inferring a continuous, potentially unbounded function from a finite sample. That’s an important and fundamental issue, but it’s only partially about, in this case, extremal Goodhart. It’s also a more general issue about inferring preferences, i.e. learning is hard and this is learning.
I find this way of formalizing Goodhart weird. Is there a standard formalization of it, or is this your invention? I’ll explain what I think is weird.
You define U and V such that you can calculate U—V to find W, but this appears to me to skip right past the most pernicious bit of Goodhart, which is that U is only knowable via a measurement (not necessarily a measure), such that I would say V=μ(U) for some “measuring” function μ:U→R and the problem is that μ(U) is correlated with but different from U since there may not even be a way to compare U.
To make it concrete with an example, suppose U is “beauty as defined by Gordon”. We don’t, at least as of yet, have a way to find U directly, and maybe we never will. So supposing we don’t, if we want to answer questions like “would Gordon find this beautiful?” and “what painting would Gordon most like?” we need to a measurement of U we can work with, as developed by, say, using IRL to discover a “beauty function” that describes U such that we could say how beautiful I would think something is. But we would be hard pressed to be precise about how far off the beauty function is from my sense of beauty because we only have a very gross measure of the difference: compare how beautiful the beauty function and I think some finite set of things are (finite because I’m a bounded, embedded agent who is never going to get to see all things, even if the beauty function somehow could), and even as we are doing this we are still getting a measurement of my internal sense of beauty rather than my internal sense of beauty itself because we are asking me to say how beautiful I think something is rather than directly observing my sense of beauty. This is much of why I expect that Goodhart is extremely robust.
I think you want to differentiate between different mechanisms for Goodhart’s law. The categorization that Scott Garrabrant put together, and I worked with him on refining, is here: https://arxiv.org/abs/1803.04585
Given that, I see several different things going on.
First, if I read the post correctly, Stuart is discussing regressional Goodhart, in this case the general issue of what Greg Lewis called “The Tails Come Apart”. This occurs whether or not the true value function is known. (As a historical note, this is a broader and, as Scott pointed out, a more fundamentally unavoidable claim than either what Goodhart meant, or what Campbell was referring to.)
Second, there is the potential for divergences between, in your example,”a measurement of my internal sense of beauty rather than my internal sense of beauty itself” is a second Goodhart affect, which is (at least) a causal one, where repeated queries change the estimates due to psychological biases, etc. In that case, there’s also a nasty potential adversarial Goodhart issue, if the AI gets to make the queries and exploits those biases.
Alternatively, if the initial sample of “your internal sense of beauty” is a fixed sample, there is a sampling and inference issue for the preferences for embedded agents—inferring a continuous, potentially unbounded function from a finite sample. That’s an important and fundamental issue, but it’s only partially about, in this case, extremal Goodhart. It’s also a more general issue about inferring preferences, i.e. learning is hard and this is learning.
Even with your stated sense of beauty, knowing “this measure can be manipulated in extreme circumstances” is much better than nothing.
And we probably know quite a bit more; I’ll continue this investigation, adding more information.