In other words, human preferences have a causal structure, can we learn its concepts and their causal relations?
[...]
Since I am not aware of anyone trying to use these techniques in AI Safety
I am not fully sure what particular sub-problem you propose to address (causal learning of the human reward function? Something different?), but some references to recent work you may find interesting:
I have not read any of the papers in the above workshops yet, but I mention them as likely sources of places where people would discuss the latest status of the type of techniques you may be considering. Work I have written/read:
I have recently speculated in Demanding and Designing Aligned Cognitive Architectures, section 8, that counterfactuals, as in Pearl Causal Model counterfactuals, may actually play a major role in defining human values and human social contracts.
Much about the above work is about hand-constructing a type of machine reasoning which is more close to human values, but the artefacts being hand-constructed might conceivably also be constructed by an ML process that leverages a certain training set in a certain way.
[...]
I am not fully sure what particular sub-problem you propose to address (causal learning of the human reward function? Something different?), but some references to recent work you may find interesting:
Two recent workshops at NeurIPS 2021:
Algorithmic Fairness through the Lens of Causality and Robustness which is about human preferences in the area of fairness.
Causal Inference & Machine Learning: Why now? which is about generally using more causal techniques in AI.
I have not read any of the papers in the above workshops yet, but I mention them as likely sources of places where people would discuss the latest status of the type of techniques you may be considering. Work I have written/read:
I have recently speculated in Demanding and Designing Aligned Cognitive Architectures, section 8, that counterfactuals, as in Pearl Causal Model counterfactuals, may actually play a major role in defining human values and human social contracts.
Along the same lines, Agent Morality via Counterfactuals in Logic Programming shows an example where a type of morality not directly related to computational fairness is encoded as a counterfactual.
Much about the above work is about hand-constructing a type of machine reasoning which is more close to human values, but the artefacts being hand-constructed might conceivably also be constructed by an ML process that leverages a certain training set in a certain way.
Hey Koen, Thanks a lot for the pointers! The literature I am most aware of are https://crl.causalai.net/, https://githubmemory.com/repo/zhijing-jin/Causality4NLP_Papers and Bernhard Scholkopf’s webpage