Jason Gross comments on Transformer Circuit Faithfulness Metrics Are Not Robust

Jason Gross 18 Jul 2024 0:41 UTC
3 points
2
But in other aspects there often isn’t a clearly correct methodology. For example, it’s unclear whether mean ablations are better than resample ablations for a particular experiment—even though this choice can dramatically change the outcome.

Would you ever really want mean ablation except as a cheaper approximation to resample ablation?

It seems to me that if you ask the question clearly enough, there’s a correct kind of ablation. For example, if the question is “how do we reproduce this behavior from scratch”, you want zero ablation.

Your table can be reorganized into the kinds of answers you’re seeking, namely:
- direct effect vs indirect effect corresponds to whether you ablate the complement of the circuit (direct effect) vs restoring the circuit itself (indirect effect, mediated by the rest of the model)
- necessity vs sufficiency corresponds to whether you ablate the circuit (direct effect necessary) / restore the complement of the circuit (indirect effect necessary) vs restoring the circuit (indirect effect sufficient) / ablating the complement of the circuit (direct effect sufficient)
- typical case vs worst case, and over what data distribution:
  - “all tokens vs specific tokens” should be absorbed into the more general category of “what’s the reference dataset distribution under consideration” / “what’s the null hypothesis over”,
  - zero ablation answers “reproduce behavior from scratch”
  - mean ablation is an approximation to resample ablation which itself is an approximation to computing the expected/typical behavior over some distribution
  - pessimal ablation is for dealing with worst-case behaviors
- granularity and component are about the scope of the solution language, and can be generalized a bit
Edit: This seems related to Hypothesis Testing the Circuit Hypothesis in LLMs
- Joseph Miller 19 Jul 2024 4:58 UTC
  1 point
  0
  Parent
  Would you ever really want mean ablation except as a cheaper approximation to resample ablation?
  
  Resample ablation is not more expensive than mean (they both are just replacing activations with different values). But to answer the question, I think you would—resample ablation biases the model toward some particular corrupt output.
  It seems to me that if you ask the question clearly enough, there’s a correct kind of ablation. For example, if the question is “how do we reproduce this behavior from scratch”, you want zero ablation.
  Yes I agree. That’s the point we were trying to communicate with “the ablation determines the task.”
  direct effect vs indirect effect corresponds to whether you ablate the complement of the circuit (direct effect) vs restoring the circuit itself (indirect effect, mediated by the rest of the model)
  necessity vs sufficiency corresponds to whether you ablate the circuit (direct effect necessary) / restore the complement of the circuit (indirect effect necessary) vs restoring the circuit (indirect effect sufficient) / ablating the complement of the circuit (direct effect sufficient)
  Thanks! That’s great perspective. We probably should have done more to connect ablations back to the causality literature.
  “all tokens vs specific tokens” should be absorbed into the more general category of “what’s the reference dataset distribution under consideration” / “what’s the null hypothesis over”,
  mean ablation is an approximation to resample ablation which itself is an approximation to computing the expected/typical behavior over some distribution
  These don’t seem correct to me, could you explain further? “Specific tokens” means “we specify the token positions at which each edge in the circuit exists”.
  - Jason Gross 21 Jul 2024 1:35 UTC
    2 points
    0
    Parent
    
    Resample ablation is not more expensive than mean (they both are just replacing activations with different values). But to answer the question, I think you would—resample ablation biases the model toward some particular corrupt output.
    
    Ah, I guess I was incorrectly imagining a more expensive version of resample ablation where you looked at not just a single corrupted cache, but looking at the result across all corrupted inputs. That is, in the simple toy model where you’re computing $f (x, y)$ where $x$ is the values for the circuit you care about and $y$ is the cache of corrupted activations, mean ablation is computing $f (x, E_{y \sim D} y)$ , and we could imagine versions of resample ablation that are computing $f (x, y)$ for some $y$ drawn from $D$ , or we could compute $E_{y \sim D} f (x, y)$ . I would say that both mean ablation and resample ablation as I’m imagining you’re describing it are both attempts to cheaply approximate $E_{y \sim D} f (x, y)$ .