If there’s 50% on a paperclips-maximizing utility function and 50% on staples, there’s not really any optimization pressure put toward satisfying both.
As you say, there’s no reason to make 50% of the universe into paperclips; that’s just not what 50% probability on paperclips means.
It could be that there’s a sorta-paperclip-sorta-staple (let’s say ‘stapleclip’ for short), which the AGI will be motivated to find in order to get a moderately high rating according to both strategies.
However, it could be that trying to be both paperclip and staple at the same time reduces the overall efficiency. Maybe the most efficient nanometer-scale stapleclip is significantly larger than the most efficient paperclip or staple, as a result of having to represent the critical features of both paperclips and staples. In this case, the AGI will prefer to gamble, tiling the universe with whatever is most efficient, and giving no consideration at all to the other hypothesis.
That’s the essence of my concern: uncertainty between possibilities does not particularly push toward jointly maximizing the possibilities. At least, not without further assumptions.
That’s all basically right, but if we’re sticking to causal Goodhart, the “without further assumptions” may be where we differ. I think that if the uncertainty is over causal structures, the “correct” structure will be more likely to increase all metrics than most others.
(I’m uncertain how to do this, but) it would be interesting to explore this over causal graphs, where a system has control over a random subset of nodes, and a metric correlated to the unobservable goal is chosen. In most cases, I’d think that leads to causal goodhart quickly, but if the set of nodes potentially used for the metric includes some that are directly causing the goal, and others than can be intercepted creating causal goodhart, uncertainty over the metric would lead to less Causal-goodharting, since targeting the actual cause should improve the correlated metrics, while the reverse is not true.
If there’s 50% on a paperclips-maximizing utility function and 50% on staples, there’s not really any optimization pressure put toward satisfying both.
As you say, there’s no reason to make 50% of the universe into paperclips; that’s just not what 50% probability on paperclips means.
It could be that there’s a sorta-paperclip-sorta-staple (let’s say ‘stapleclip’ for short), which the AGI will be motivated to find in order to get a moderately high rating according to both strategies.
However, it could be that trying to be both paperclip and staple at the same time reduces the overall efficiency. Maybe the most efficient nanometer-scale stapleclip is significantly larger than the most efficient paperclip or staple, as a result of having to represent the critical features of both paperclips and staples. In this case, the AGI will prefer to gamble, tiling the universe with whatever is most efficient, and giving no consideration at all to the other hypothesis.
That’s the essence of my concern: uncertainty between possibilities does not particularly push toward jointly maximizing the possibilities. At least, not without further assumptions.
That’s all basically right, but if we’re sticking to causal Goodhart, the “without further assumptions” may be where we differ. I think that if the uncertainty is over causal structures, the “correct” structure will be more likely to increase all metrics than most others.
(I’m uncertain how to do this, but) it would be interesting to explore this over causal graphs, where a system has control over a random subset of nodes, and a metric correlated to the unobservable goal is chosen. In most cases, I’d think that leads to causal goodhart quickly, but if the set of nodes potentially used for the metric includes some that are directly causing the goal, and others than can be intercepted creating causal goodhart, uncertainty over the metric would lead to less Causal-goodharting, since targeting the actual cause should improve the correlated metrics, while the reverse is not true.