I think you want to differentiate between different mechanisms for Goodhart’s law. The categorization that Scott Garrabrant put together, and I worked with him on refining, is here: https://arxiv.org/abs/1803.04585
Given that, I see several different things going on.
First, if I read the post correctly, Stuart is discussing regressional Goodhart, in this case the general issue of what Greg Lewis called “The Tails Come Apart”. This occurs whether or not the true value function is known. (As a historical note, this is a broader and, as Scott pointed out, a more fundamentally unavoidable claim than either what Goodhart meant, or what Campbell was referring to.)
Second, there is the potential for divergences between, in your example,”a measurement of my internal sense of beauty rather than my internal sense of beauty itself” is a second Goodhart affect, which is (at least) a causal one, where repeated queries change the estimates due to psychological biases, etc. In that case, there’s also a nasty potential adversarial Goodhart issue, if the AI gets to make the queries and exploits those biases.
Alternatively, if the initial sample of “your internal sense of beauty” is a fixed sample, there is a sampling and inference issue for the preferences for embedded agents—inferring a continuous, potentially unbounded function from a finite sample. That’s an important and fundamental issue, but it’s only partially about, in this case, extremal Goodhart. It’s also a more general issue about inferring preferences, i.e. learning is hard and this is learning.
I think you want to differentiate between different mechanisms for Goodhart’s law. The categorization that Scott Garrabrant put together, and I worked with him on refining, is here: https://arxiv.org/abs/1803.04585
Given that, I see several different things going on.
First, if I read the post correctly, Stuart is discussing regressional Goodhart, in this case the general issue of what Greg Lewis called “The Tails Come Apart”. This occurs whether or not the true value function is known. (As a historical note, this is a broader and, as Scott pointed out, a more fundamentally unavoidable claim than either what Goodhart meant, or what Campbell was referring to.)
Second, there is the potential for divergences between, in your example,”a measurement of my internal sense of beauty rather than my internal sense of beauty itself” is a second Goodhart affect, which is (at least) a causal one, where repeated queries change the estimates due to psychological biases, etc. In that case, there’s also a nasty potential adversarial Goodhart issue, if the AI gets to make the queries and exploits those biases.
Alternatively, if the initial sample of “your internal sense of beauty” is a fixed sample, there is a sampling and inference issue for the preferences for embedded agents—inferring a continuous, potentially unbounded function from a finite sample. That’s an important and fundamental issue, but it’s only partially about, in this case, extremal Goodhart. It’s also a more general issue about inferring preferences, i.e. learning is hard and this is learning.