Jason Gross comments on Transformer Circuit Faithfulness Metrics Are Not Robust

Jason Gross 21 Jul 2024 1:35 UTC
2 points
0

Resample ablation is not more expensive than mean (they both are just replacing activations with different values). But to answer the question, I think you would—resample ablation biases the model toward some particular corrupt output.

Ah, I guess I was incorrectly imagining a more expensive version of resample ablation where you looked at not just a single corrupted cache, but looking at the result across all corrupted inputs. That is, in the simple toy model where you’re computing $f (x, y)$ where $x$ is the values for the circuit you care about and $y$ is the cache of corrupted activations, mean ablation is computing $f (x, E_{y \sim D} y)$ , and we could imagine versions of resample ablation that are computing $f (x, y)$ for some $y$ drawn from $D$ , or we could compute $E_{y \sim D} f (x, y)$ . I would say that both mean ablation and resample ablation as I’m imagining you’re describing it are both attempts to cheaply approximate $E_{y \sim D} f (x, y)$ .