… and even then, that argument would only work if at least one shard exactly matches what we intended. If all of the shards are imperfect proxies, then most likely there are actions which can Goodhart all of them simultaneously. (After all, proxy failure is not something we’d expect to be uncorrelated—conditions which cause one proxy to fail probably cause many to fail in similar ways.)
I read this as “the activations and bidding behaviors of the shards will itself be imperfect, so you get the usual ‘Goodhart’ problem where highly rated plans are systematically bad and not what you wanted.” I disagree with the conclusion, at least for many kinds of “imperfections.”
Below is one shot at instantiating the failure mode you’re describing. I wrote this story so as to (hopefully) contain the relevant elements. This isn’t meant as a “slam dunk case closed”, but hopefully something which helps you understand how I’m thinking about the issue and why I don’t anticipate “and then the shards get Goodharted.”
Example shard-Goodharting scenario. The AI bids for plans which it thinks lead to diamonds, except that also, the subcircuit of the policy network which computes the relevant diamond abstraction—this is only a “proxy” for a reliable diamond abstraction. Historically unknown to the AI until the end of its training, that subcircuit (for some reason) activates very strongly for plans which lead to certain diamond-shaped formations of bacteria on the third Tuesday of the year.
Then this shard can be “goodharted” by actions which involve the creation of these bacteria diamonds at that time. There’s a question, though, of whether the AI will actually consider these plans (so that it then actually bids on this plan, which is rated spuriously highly from our perspective). The AI knows, abstractly, that considering this plan would lead it to bid for that plan. But it seems to me like, since generating that plan is reflectively predicted to not lead to diamonds (nor does it activate the specific bidding-behavior edge case the agent abstractly knows about), the agent doesn’t pursue that plan.
Nonrobust decision-influences can be OK. A candy-shard contextually influences decision-making. Many policies lead to acquiring lots of candy; the decision-influences don’t have to be “globally robust” or “perfect.”
Values steer optimization; they are not optimized against. The value shards aren’t getting optimized hard. The value shards are the things which optimize hard, by wielding the rest of the agent’s cognition (e.g. the world model, the general-purpose planning API).
Since values are not the optimization target of the agent with those values, the values don’t have to be adversarially robust.
Since values steer cognition, reflective agents try to avoid adversarial inputs to their own values. In self-reflective agents which can think about their own thinking, values steer e.g. what plans get considered next. Therefore, these agents convergently avoid adversarial inputs to their currently activated values (e.g. learning), because adversarial inputs would impede fulfillment of those values (e.g. lead to less learning).
This suggests “and so what is an ‘adversarial input’ to the values, then? What intensional rule governs the kinds of high-scoring plans which internal reasoning will decide to not evaluate in full?”. I haven’t answered that question yet on an intensional basis, but it seems tractable.
I read this as “the activations and bidding behaviors of the shards will itself be imperfect, so you get the usual ‘Goodhart’ problem where highly rated plans are systematically bad and not what you wanted.” I disagree with the conclusion, at least for many kinds of “imperfections.”
Below is one shot at instantiating the failure mode you’re describing. I wrote this story so as to (hopefully) contain the relevant elements. This isn’t meant as a “slam dunk case closed”, but hopefully something which helps you understand how I’m thinking about the issue and why I don’t anticipate “and then the shards get Goodharted.”
Then this shard can be “goodharted” by actions which involve the creation of these bacteria diamonds at that time. There’s a question, though, of whether the AI will actually consider these plans (so that it then actually bids on this plan, which is rated spuriously highly from our perspective). The AI knows, abstractly, that considering this plan would lead it to bid for that plan. But it seems to me like, since generating that plan is reflectively predicted to not lead to diamonds (nor does it activate the specific bidding-behavior edge case the agent abstractly knows about), the agent doesn’t pursue that plan.
This was one of the main ideas I discussed in Alignment allows “nonrobust” decision-influences and doesn’t require robust grading:
This suggests “and so what is an ‘adversarial input’ to the values, then? What intensional rule governs the kinds of high-scoring plans which internal reasoning will decide to not evaluate in full?”. I haven’t answered that question yet on an intensional basis, but it seems tractable.