Buck comments on Practical Pitfalls of Causal Scrubbing

Buck 27 Mar 2023 14:45 UTC
LW: 5 AF: 3
0
AF
Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.
The reason to stamp it down to a one-dimensional quantity is that sometimes the phenomenon that we wanted to explain is the expectation of a one-dimensional quantity, and we don’t want to require that our tests explain things other than that particular quantity. For example, in an alignment context, I might want to understand why my model does well on the validation set, perhaps in the hope that if I understand why the model performs well, I’ll be able to predict whether it will generalize correctly onto a particular new distribution.
The main problem with evaluating a hypothesis by KL divergence is that if you do this, your explanation looks bad in cases like the following: There’s some interference in the model which manifests as random noise, and the explanation failed to preserve the interference pattern. In this case, your explanation has a bunch of random error in its prediction of what the model does, which will hurt the KL. But that interference was random and understanding it won’t help you know if the mechanism that the model was using is going to generalize well to another distribution.
There are other similar cases than the interference one. For example, if your model has a heuristic that fires on some of the subdistribution you’re trying to understand the model’s behavior on, but not in a way that ends up affecting the model’s average performance, this is basically another source of noise that you (at least often) end up not wanting your explanation to have to capture.
- davidad 27 Mar 2023 21:48 UTC
  2 points
  0
  Parent
  As an alternative summary statistic of the extent to which the ablated model performs worse on average, I would suggest the Bayesian Wilcoxon signed-rank test.
- Lucius Bushnaq 27 Mar 2023 16:16 UTC
  2 points
  −1
  Parent
  The main problem with evaluating a hypothesis by KL divergence is that if you do this, your explanation looks bad in cases like the following:
  I would take this as indication that the explanation is inadequate. If I said that the linear combination of nodes $n_{1} - 1.3 \cdot n_{2}$ at layer l of a NN implements the function $f$ , but in fact it implements $f + a g$ , where g does some other thing, my hypothesis was incorrect, and I’d want the metric to show that. If I haven’t even disentangled the mechanism I claim to have found from all the other surrounding circuits, I don’t think I get to say my hypothesis is doing a good job. Otherwise it seems like I have a lot of freedom to make up spurious hypotheses that claim whatever, and hide the inadequacies as “small random fluctuations” in the ablated test loss.
  The reason to stamp it down to a one-dimensional quantity is that sometimes the phenomenon that we wanted to explain is the expectation of a one-dimensional quantity, and we don’t want to require that our tests explain things other than that particular quantity. For example, in an alignment context, I might want to understand why my model does well on the validation set, perhaps in the hope that if I understand why the model performs well, I’ll be able to predict whether it will generalize correctly onto a particular new distribution.
  I don’t see how the dimensionality of the quantity you want to understand the generative mechanism of relates to the dimensionality of the comparison you would want to carry out to evaluate a proposed generative mechanism.
  
  I want to understand how the model computes its outputs to get loss $l$ on distribution $P_{1}$ , so I can predict what loss it will get on another distribution $P_{2}$ . I make a hypothesis for what the mechanism is. The mechanism implies that doing intervention $h$ on the network, say shifting $x_{5}$ to $2.0 \cdot x_{5}$ , should not change behaviour, because the NN only cares about $s g n (x_{5})$ , not its magnitude. If I then see that the intervention does shift output behaviour, even if it does not change the value of $l$ on net, my hypothesis was wrong. The magnitude of $x_{5}$ does play a part in the network’s computations on $P_{1}$ . It has an influence on the output.
  But that interference was random and understanding it won’t help you know if the mechanism that the model was using is going to generalize well to another distribution.
  If it had no effect on how outputs for $P_{1}$ are computed, then destroying it should not change behaviour on $P_{1}$ . So there should be no divergence between the original and ablated models’ outputs. If it did affect behaviour on $P_{1}$ , but not in ways that contribute net negatively or net positively to the accuracy on that particular distribution, it seems that you still want to know about it, because once you understand what it does, you might see that it will contribute net negatively or net positively to the model’s ability to do well on $P_{2}$ .
  A heuristic that fires on some of $P_{1}$ , but doesn’t really help much, might turn out to be crucial for doing well on $P_{2}$ . A leftover memorised circuit that didn’t get cleaned up might add harmless “noise” on net on $P_{1}$ , but ruin generalisation to $P_{2}$ .
  
  I would expect this to be reasonably common. A very general solution is probably overkill for a narrow sub dataset, containing many circuits that check for possible exception cases, but aren’t really necessary for that particular class of inputs. If you throw out everything that doesn’t do much to the loss on net, your explanations will miss the existence of these circuits, and you might wrongly conclude that the solution you are looking at is narrow and will not generalise.