I like the direction you’re going with this. I agree that causal reasoning is necessary, but not sufficient, for getting alignable TAI. I think just getting step 2 done (Summarize the literature on causality from an AI safety perspective) could have a huge impact in terms of creating a very helpful resource for AI alignment researches to pull insights from.
Our worry is that ML researchers, once they figure out how, will introduce a similar “overidentifying causality” inductive bias into models. This would mean that very powerful models with potentially big impacts have the causal model of a political pundit rather than a scientist.
One possible workaround for this would be to take a Bayesian approach. Bayes’ rule is all about comparing the predictions (likelihoods) of different models (hypotheses) to assign higher probability mass to those models with greater predictive power.
Consider a system that uses an ensemble of differently structured causal models, each containing slots for different factors (e.g., {A, B} (no causal relationship), {A → B}, {A ← B}, {A ← C → B}, …). Then for any given phenomenon, the system could feed in all relevant factors into the slots of the causal graph of each model, then use each model to make predictions, both about passive observations and about the results of interventions. Those causal (or acausal) models with the greatest predictive power would win out after the accumulation of enough evidence.
Of course, the question still remains about how the system chooses which factors are relevant, or about how it decides what kind of state transformations each causal arrow induces. But I think the general idea of multiple hypothesis testing should be sufficient to get any causally reasoning AI to think more like a scientist than a pundit.
I like the direction you’re going with this. I agree that causal reasoning is necessary, but not sufficient, for getting alignable TAI. I think just getting step 2 done (Summarize the literature on causality from an AI safety perspective) could have a huge impact in terms of creating a very helpful resource for AI alignment researches to pull insights from.
One possible workaround for this would be to take a Bayesian approach. Bayes’ rule is all about comparing the predictions (likelihoods) of different models (hypotheses) to assign higher probability mass to those models with greater predictive power.
Consider a system that uses an ensemble of differently structured causal models, each containing slots for different factors (e.g., {A, B} (no causal relationship), {A → B}, {A ← B}, {A ← C → B}, …). Then for any given phenomenon, the system could feed in all relevant factors into the slots of the causal graph of each model, then use each model to make predictions, both about passive observations and about the results of interventions. Those causal (or acausal) models with the greatest predictive power would win out after the accumulation of enough evidence.
Of course, the question still remains about how the system chooses which factors are relevant, or about how it decides what kind of state transformations each causal arrow induces. But I think the general idea of multiple hypothesis testing should be sufficient to get any causally reasoning AI to think more like a scientist than a pundit.