Buck comments on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Buck 21 Jan 2023 20:15 UTC
LW: 2 AF: 1
0
AF
Something like this might be a good idea :) . We’ve thought about various ideas along these lines. The basic problem is that in such cases, you might be taking the model importantly off distribution, such that it seems to me that your test might fail even if the hypothesis was a correct explanation of how the model worked on-distribution.