I might be misinterpreting, but what I think you’re saying is that if the humans make a mistake in using a causal model of the world and tell the AI to optimize for something bad-in-retrospect, this is “mistaken causal structure lead[ing] to regressional or extremal Goodhart”, and thus not really causal Goodhart per se (by the categories you’re intending). But I’m still a little fuzzy on what you mean to be actual factual causal Goodhart.
Is the idea that humans tell the AI to optimize for something that is not bad-in-retrospect, but in the process of changing the world the causal model the AI is using will move outside its domain of validity? Does this only happen if the AI’s model of the world is lacking compared to the humans’?
Yes on point Number 1, and partly on point number 2.
If humans don’t have incredibly complete models for how to achieve their goals, but know they want a glass of water, telling the AI to put a cup of H2O in front of them can create weird mistakes. This can even happen because of causal connections the humans are unaware of. The AI might have better causal models than the humans, but still cause problems for other reasons. In this case, a human might not know the difference between normal water and heavy water, but the AI might decide that since there are two forms, it should have them present in equal amounts, which would be disastrous for reasons entirely beyond the understanding of the human who asked for the glass of water. The human needed to specify the goal differently, and was entirely unaware of what they did wrong—and in this case it will be months before the impacts of the weirdly different than expected water show up, so human-in-the-loop RL or other methods might not catch it.
Necromantic comment, sorry :P
I might be misinterpreting, but what I think you’re saying is that if the humans make a mistake in using a causal model of the world and tell the AI to optimize for something bad-in-retrospect, this is “mistaken causal structure lead[ing] to regressional or extremal Goodhart”, and thus not really causal Goodhart per se (by the categories you’re intending). But I’m still a little fuzzy on what you mean to be actual factual causal Goodhart.
Is the idea that humans tell the AI to optimize for something that is not bad-in-retrospect, but in the process of changing the world the causal model the AI is using will move outside its domain of validity? Does this only happen if the AI’s model of the world is lacking compared to the humans’?
Yes on point Number 1, and partly on point number 2.
If humans don’t have incredibly complete models for how to achieve their goals, but know they want a glass of water, telling the AI to put a cup of H2O in front of them can create weird mistakes. This can even happen because of causal connections the humans are unaware of. The AI might have better causal models than the humans, but still cause problems for other reasons. In this case, a human might not know the difference between normal water and heavy water, but the AI might decide that since there are two forms, it should have them present in equal amounts, which would be disastrous for reasons entirely beyond the understanding of the human who asked for the glass of water. The human needed to specify the goal differently, and was entirely unaware of what they did wrong—and in this case it will be months before the impacts of the weirdly different than expected water show up, so human-in-the-loop RL or other methods might not catch it.