Kshitij Sachan comments on Towards understanding-based safety evaluations

Kshitij Sachan 16 Mar 2023 17:13 UTC
LW: 5 AF: 4
2
AF
Causal Scrubbing: My main problem with causal scrubbing as a solution here is that only guarantees the sufficiency, but not the necessity, or your explanation. As a result, my understanding is that a causal-scrubbing-based evaluation would admit a trivial explanation that simply asserts that the entire model is relevant for every behavior.

Redwood has been experimenting with learning (via gradient descent) causal scrubbing explanations that are somewhat addressing your necessity point. Specifically:
1. “Larger” explanations are penalized more (size refers to the number of dimensions of the residual stream the explanation claims the model is using for a specific behavior).
2. Explanations must be adversarially robust: an adversary shouldn’t be able to include additional parts of the model we claimed are unimportant and have a sizable effect on the scrubbed model’s predictions.
This approach doesn’t address all the concerns one might have with using causal scrubbing to understand models, but just wanted to flag that this is something we’re thinking about as well.