Brainstorming approaches to working with causal goodhart
Low-impact measures that include the change in the causal structure of the world. It might be possible to form a measure like this which doesn’t depend on recovering the true causal structure at any point (ie. minimizing the difference between predictions of causal structure in state A and B, even if both of those predictions are wrong)
Figure out how to elicit human models of causal structure, and provide the human model of causal structure along with the metric, and the AI uses this information to figure out whether it’s violating the assumptions that the human made
Causal transparency: have the AI explain the causal structure of how it’s plans will influence the proxy. This might allow a human to figure out whether the plan will cause the proxy to diverge from the goal. ie. True goal is happiness, proxy is happiness score as measured by online psychological questionnaire, AI’s plan says that it will influence the proxy by hacking into the online psychological questionnaire. You don’t need to understand how the AI plans to hack into the server to understand that the plan is diverging the proxy from the goal.
These are interesting ideas. I’m not sure I understand what you mean by the first; causal structure can be arbitrarily complex, so I’m unsure how to mitigate across the plausible structures. (It seems to be an AIXI-like problem.)
2&3, however, require that humans understand the domain, and too-often in existing systems we do not. Superhuman AI might be better than us at this, but if causal understanding scales more slowly than capability, it would still fail.
Brainstorming approaches to working with causal goodhart
Low-impact measures that include the change in the causal structure of the world. It might be possible to form a measure like this which doesn’t depend on recovering the true causal structure at any point (ie. minimizing the difference between predictions of causal structure in state A and B, even if both of those predictions are wrong)
Figure out how to elicit human models of causal structure, and provide the human model of causal structure along with the metric, and the AI uses this information to figure out whether it’s violating the assumptions that the human made
Causal transparency: have the AI explain the causal structure of how it’s plans will influence the proxy. This might allow a human to figure out whether the plan will cause the proxy to diverge from the goal. ie. True goal is happiness, proxy is happiness score as measured by online psychological questionnaire, AI’s plan says that it will influence the proxy by hacking into the online psychological questionnaire. You don’t need to understand how the AI plans to hack into the server to understand that the plan is diverging the proxy from the goal.
These are interesting ideas. I’m not sure I understand what you mean by the first; causal structure can be arbitrarily complex, so I’m unsure how to mitigate across the plausible structures. (It seems to be an AIXI-like problem.)
2&3, however, require that humans understand the domain, and too-often in existing systems we do not. Superhuman AI might be better than us at this, but if causal understanding scales more slowly than capability, it would still fail.