In order to allow the system to disagree with a human, we need to use some metric other than “is simple and assigns high probability to human judgments”.
Right: and the metric I would propose is, “counterfactual-prediction power”. Or in other words, the power to predict well in a causal fashion, to be able to answer counterfactual questions or predict well when we deliberately vary the experimental conditions.
To give a simple example: I train a system to recognize cats, but my training data contains only tabbies. What I want is a way of modelling that, while it may concentrate more probability on a tabby cat-like-thingy being a cat than a non-tabby cat-like-thingy, will still predict appropriately if I actually condition it on “but what if cats weren’t tabby by nature?”.
I think you said you’re a follower of the probabilistic programming approach, and in terms of being able to condition those models on counterfactual parameterizations and make predictions, I think they’re very much on the right track.
Right: and the metric I would propose is, “counterfactual-prediction power”. Or in other words, the power to predict well in a causal fashion, to be able to answer counterfactual questions or predict well when we deliberately vary the experimental conditions.
To give a simple example: I train a system to recognize cats, but my training data contains only tabbies. What I want is a way of modelling that, while it may concentrate more probability on a tabby cat-like-thingy being a cat than a non-tabby cat-like-thingy, will still predict appropriately if I actually condition it on “but what if cats weren’t tabby by nature?”.
I think you said you’re a follower of the probabilistic programming approach, and in terms of being able to condition those models on counterfactual parameterizations and make predictions, I think they’re very much on the right track.