rusheb

Karma: 170

rusheb 18 Jan 2023 13:29 UTC
LW: 4 AF: 1
0
AF
on: Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
If your hypothesis predicts that model performance will be preserved if you swap the input to any other input which has a particular property, but no other inputs in the dataset have that property, causal scrubbing can’t test your hypothesis
Would it be possible to make interventions which we expect not to preserve the model’s behaviour, and assert that the behaviour does in fact change?