I will, however, defend some of Stuart’s suggestions as they relate to causal Goodhart in a non-adversarial setting. - I’m also avoiding the can of worms of game theory. In that case, both randomization AND mixtures of multiple metrics can address Goodhart-like failures, albeit in different ways. I had been thinking about this in the context of policy—https://mpra.ub.uni-muenchen.de/90649/ - rather than AI alignment, but some of the arguments still apply. (One critical argument that doesn’t fully apply is that “good enough” mitigation raises the cognitive costs of cheating to a point where aligning with the true goal is cheaper. I also noted in the paper that satisficing is useful for limiting the misalignment from metrics, and quantilization seems like one promising approach for satisficing for AGI.)
The argument for causal goodhart is that randomization and mixed utilities are both effective in mitigating causal structure errors that lead to causal Goodhart in the one-party case. That’s because the failure occurs when uncertainty or mistakes about causal structure leads to choice of metrics that are corrrelated with the goal, rather than causal of the goal. However, if even some significant fraction or probability of the metric is causally connected to the metrics in ways that cannot be gamed, it can greatly mitigate this class of failure.
To more clearly apply this logic to human utility, if we accidentally think that endorphins in the brain are 100% of human goals, AGI might want to tile the universe with rats on happy drugs, or the moral equivalent. If we assign this only 50% weight, of have a 50% probability that it will be the scored outcome, and we define something that requires a different way of creating what we actually think of as happiness / life satisfaction, it does not just shift the optimum from 50% of the universe tiled with rat brains. This is because the alternative class of hedonium will involve a non-trivial amount of endorphins as well, as long as other solutions have anywhere close to as much endorphins, they will be preferred. (In this case, admittedly, we got the endorphin goal so wrong that 50% of the universe tiled in rats on drugs is likely—bad enough utility functions can’t be fixed with either randomization or weighting. But if a causal mistake can be fixed with either a probabilistic or a weighting solution, it seems likely it can be fixed with the other.)
If there’s 50% on a paperclips-maximizing utility function and 50% on staples, there’s not really any optimization pressure put toward satisfying both.
As you say, there’s no reason to make 50% of the universe into paperclips; that’s just not what 50% probability on paperclips means.
It could be that there’s a sorta-paperclip-sorta-staple (let’s say ‘stapleclip’ for short), which the AGI will be motivated to find in order to get a moderately high rating according to both strategies.
However, it could be that trying to be both paperclip and staple at the same time reduces the overall efficiency. Maybe the most efficient nanometer-scale stapleclip is significantly larger than the most efficient paperclip or staple, as a result of having to represent the critical features of both paperclips and staples. In this case, the AGI will prefer to gamble, tiling the universe with whatever is most efficient, and giving no consideration at all to the other hypothesis.
That’s the essence of my concern: uncertainty between possibilities does not particularly push toward jointly maximizing the possibilities. At least, not without further assumptions.
That’s all basically right, but if we’re sticking to causal Goodhart, the “without further assumptions” may be where we differ. I think that if the uncertainty is over causal structures, the “correct” structure will be more likely to increase all metrics than most others.
(I’m uncertain how to do this, but) it would be interesting to explore this over causal graphs, where a system has control over a random subset of nodes, and a metric correlated to the unobservable goal is chosen. In most cases, I’d think that leads to causal goodhart quickly, but if the set of nodes potentially used for the metric includes some that are directly causing the goal, and others than can be intercepted creating causal goodhart, uncertainty over the metric would lead to less Causal-goodharting, since targeting the actual cause should improve the correlated metrics, while the reverse is not true.
See my much shorter and less developed note to a similar effect: https://www.lesswrong.com/posts/QJwnPRBBvgaeFeiLR/uncertainty-versus-fuzziness-versus-extrapolation-desiderata#kZmpMGYGfwGKQwfZs—and I agree that regressional and extremal goodhart cannot be fixed purely with his solution.
I will, however, defend some of Stuart’s suggestions as they relate to causal Goodhart in a non-adversarial setting. - I’m also avoiding the can of worms of game theory. In that case, both randomization AND mixtures of multiple metrics can address Goodhart-like failures, albeit in different ways. I had been thinking about this in the context of policy—https://mpra.ub.uni-muenchen.de/90649/ - rather than AI alignment, but some of the arguments still apply. (One critical argument that doesn’t fully apply is that “good enough” mitigation raises the cognitive costs of cheating to a point where aligning with the true goal is cheaper. I also noted in the paper that satisficing is useful for limiting the misalignment from metrics, and quantilization seems like one promising approach for satisficing for AGI.)
The argument for causal goodhart is that randomization and mixed utilities are both effective in mitigating causal structure errors that lead to causal Goodhart in the one-party case. That’s because the failure occurs when uncertainty or mistakes about causal structure leads to choice of metrics that are corrrelated with the goal, rather than causal of the goal. However, if even some significant fraction or probability of the metric is causally connected to the metrics in ways that cannot be gamed, it can greatly mitigate this class of failure.
To more clearly apply this logic to human utility, if we accidentally think that endorphins in the brain are 100% of human goals, AGI might want to tile the universe with rats on happy drugs, or the moral equivalent. If we assign this only 50% weight, of have a 50% probability that it will be the scored outcome, and we define something that requires a different way of creating what we actually think of as happiness / life satisfaction, it does not just shift the optimum from 50% of the universe tiled with rat brains. This is because the alternative class of hedonium will involve a non-trivial amount of endorphins as well, as long as other solutions have anywhere close to as much endorphins, they will be preferred. (In this case, admittedly, we got the endorphin goal so wrong that 50% of the universe tiled in rats on drugs is likely—bad enough utility functions can’t be fixed with either randomization or weighting. But if a causal mistake can be fixed with either a probabilistic or a weighting solution, it seems likely it can be fixed with the other.)
If there’s 50% on a paperclips-maximizing utility function and 50% on staples, there’s not really any optimization pressure put toward satisfying both.
As you say, there’s no reason to make 50% of the universe into paperclips; that’s just not what 50% probability on paperclips means.
It could be that there’s a sorta-paperclip-sorta-staple (let’s say ‘stapleclip’ for short), which the AGI will be motivated to find in order to get a moderately high rating according to both strategies.
However, it could be that trying to be both paperclip and staple at the same time reduces the overall efficiency. Maybe the most efficient nanometer-scale stapleclip is significantly larger than the most efficient paperclip or staple, as a result of having to represent the critical features of both paperclips and staples. In this case, the AGI will prefer to gamble, tiling the universe with whatever is most efficient, and giving no consideration at all to the other hypothesis.
That’s the essence of my concern: uncertainty between possibilities does not particularly push toward jointly maximizing the possibilities. At least, not without further assumptions.
That’s all basically right, but if we’re sticking to causal Goodhart, the “without further assumptions” may be where we differ. I think that if the uncertainty is over causal structures, the “correct” structure will be more likely to increase all metrics than most others.
(I’m uncertain how to do this, but) it would be interesting to explore this over causal graphs, where a system has control over a random subset of nodes, and a metric correlated to the unobservable goal is chosen. In most cases, I’d think that leads to causal goodhart quickly, but if the set of nodes potentially used for the metric includes some that are directly causing the goal, and others than can be intercepted creating causal goodhart, uncertainty over the metric would lead to less Causal-goodharting, since targeting the actual cause should improve the correlated metrics, while the reverse is not true.