When I talk about shard theory, peopleoftenseemto shrug and go “well, you still need to get the values perfect else Goodhart; I don’t see how this ‘value shard’ thing helps.”
I realize you are summarizing a general vibe from multiple people, but I want to note that this is not what I said. The most relevant piece from my comment is:
I don’t buy this as stated; just as “you have a literally perfect overseer” seems theoretically possible but unrealistic, so too does “you instill the direct goal literally exactly correctly”. Presumably one of these works better in practice than the other, but it’s not obvious to me which one it is.
In other words: Goodhart is a problem with values-execution, and it is not clear which of values-execution and grader-optimization degrades more gracefully. In particular, I don’t think you need to get the values perfect. I just also don’t think you need to get the grader perfect in grader-optimization paradigms, and am uncertain about which one ends up being better.
I understand this to mean “Goodhart is and historically has been about how an agent with different values can do bad things.” I think this isn’t true. Goodhart concepts were coined within the grader-optimization/argmax/global-objective-optimization frame:
Throughout the post, I will use V to refer to the true goal and use U to refer to a proxy for that goal which was observed to correlate with V and which is being optimized in some way.
This cleanly maps onto the grader-optimization case, where U is the grader and V is some supposed imaginary “true set of goals” (which I’m quite dubious of, actually).
This doesn’t cleanly map onto the value shard case. The AI’s shards cannot be U, because they aren’t being optimized. The shards do the optimizing.
So now we run into a different regime of problems AFAICT, and I push back against calling this “Goodhart.” For one, e.g. extremal Goodhart has a “global” character, where imperfections in U get blown up in the exponentially-sized plan space where many adversarial inputs lurk. Saying “value shards are vulnerable to Goodhart” makes me anticipate the wrong things. It makes me anticipate that if the shards are “wrong” (whatever that means) in some arcane situation, the agent will do a bad thing by exploiting the error. As explained in this post, that’s just not how values work, but it is how grader-optimizers often work.
While it’s worth considering what value-perturbations are tolerable versus what grader-perturbations are tolerable, I don’t think it makes sense to describe both risk profiles with “Goodhart problems.”
it is not clear which of values-execution and grader-optimization degrades more gracefully. In particular, I don’t think you need to get the values perfect. I just also don’t think you need to get the grader perfect in grader-optimization paradigms, and am uncertain about which one ends up being better.
I changed “perfect” to “robust” throughout the text. Values do not have to be “robust” against an adversary’s optimization, in order for the agent to reliably e.g. make diamonds. The grader does have to be robust against the actor, in order to force the actor to choose an intended plan.
I realize you are summarizing a general vibe from multiple people, but I want to note that this is not what I said. The most relevant piece from my comment is:
In other words: Goodhart is a problem with values-execution, and it is not clear which of values-execution and grader-optimization degrades more gracefully. In particular, I don’t think you need to get the values perfect. I just also don’t think you need to get the grader perfect in grader-optimization paradigms, and am uncertain about which one ends up being better.
I understand this to mean “Goodhart is and historically has been about how an agent with different values can do bad things.” I think this isn’t true. Goodhart concepts were coined within the grader-optimization/argmax/global-objective-optimization frame:
This cleanly maps onto the grader-optimization case, where U is the grader and V is some supposed imaginary “true set of goals” (which I’m quite dubious of, actually).
This doesn’t cleanly map onto the value shard case. The AI’s shards cannot be U, because they aren’t being optimized. The shards do the optimizing.
So now we run into a different regime of problems AFAICT, and I push back against calling this “Goodhart.” For one, e.g. extremal Goodhart has a “global” character, where imperfections in U get blown up in the exponentially-sized plan space where many adversarial inputs lurk. Saying “value shards are vulnerable to Goodhart” makes me anticipate the wrong things. It makes me anticipate that if the shards are “wrong” (whatever that means) in some arcane situation, the agent will do a bad thing by exploiting the error. As explained in this post, that’s just not how values work, but it is how grader-optimizers often work.
While it’s worth considering what value-perturbations are tolerable versus what grader-perturbations are tolerable, I don’t think it makes sense to describe both risk profiles with “Goodhart problems.”
I changed “perfect” to “robust” throughout the text. Values do not have to be “robust” against an adversary’s optimization, in order for the agent to reliably e.g. make diamonds. The grader does have to be robust against the actor, in order to force the actor to choose an intended plan.