Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.
The reason to stamp it down to a one-dimensional quantity is that sometimes the phenomenon that we wanted to explain is the expectation of a one-dimensional quantity, and we don’t want to require that our tests explain things other than that particular quantity. For example, in an alignment context, I might want to understand why my model does well on the validation set, perhaps in the hope that if I understand why the model performs well, I’ll be able to predict whether it will generalize correctly onto a particular new distribution.
The main problem with evaluating a hypothesis by KL divergence is that if you do this, your explanation looks bad in cases like the following: There’s some interference in the model which manifests as random noise, and the explanation failed to preserve the interference pattern. In this case, your explanation has a bunch of random error in its prediction of what the model does, which will hurt the KL. But that interference was random and understanding it won’t help you know if the mechanism that the model was using is going to generalize well to another distribution.
There are other similar cases than the interference one. For example, if your model has a heuristic that fires on some of the subdistribution you’re trying to understand the model’s behavior on, but not in a way that ends up affecting the model’s average performance, this is basically another source of noise that you (at least often) end up not wanting your explanation to have to capture.
As an alternative summary statistic of the extent to which the ablated model performs worse on average, I would suggest the Bayesian Wilcoxon signed-rank test.
The main problem with evaluating a hypothesis by KL divergence is that if you do this, your explanation looks bad in cases like the following:
I would take this as indication that the explanation is inadequate. If I said that the linear combination of nodes n1−1.3⋅n2 at layer l of a NN implements the function f, but in fact it implements f+ag, where g does some other thing, my hypothesis was incorrect, and I’d want the metric to show that. If I haven’t even disentangled the mechanism I claim to have found from all the other surrounding circuits, I don’t think I get to say my hypothesis is doing a good job. Otherwise it seems like I have a lot of freedom to make up spurious hypotheses that claim whatever, and hide the inadequacies as “small random fluctuations” in the ablated test loss.
The reason to stamp it down to a one-dimensional quantity is that sometimes the phenomenon that we wanted to explain is the expectation of a one-dimensional quantity, and we don’t want to require that our tests explain things other than that particular quantity. For example, in an alignment context, I might want to understand why my model does well on the validation set, perhaps in the hope that if I understand why the model performs well, I’ll be able to predict whether it will generalize correctly onto a particular new distribution.
I don’t see how the dimensionality of the quantity you want to understand the generative mechanism of relates to the dimensionality of the comparison you would want to carry out to evaluate a proposed generative mechanism.
I want to understand how the model computes its outputs to get loss l on distribution P1, so I can predict what loss it will get on another distribution P2. I make a hypothesis for what the mechanism is. The mechanism implies that doing intervention h on the network, say shifting x5 to 2.0⋅x5, should not change behaviour, because the NN only cares about sgn(x5), not its magnitude. If I then see that the intervention does shift output behaviour, even if it does not change the value of l on net, my hypothesis was wrong. The magnitude of x5 does play a part in the network’s computations on P1. It has an influence on the output.
But that interference was random and understanding it won’t help you know if the mechanism that the model was using is going to generalize well to another distribution.
If it had no effect on how outputs for P1 are computed, then destroying it should not change behaviour on P1. So there should be no divergence between the original and ablated models’ outputs. If it did affect behaviour on P1, but not in ways that contribute net negatively or net positively to the accuracy on that particular distribution, it seems that you still want to know about it, because once you understand what it does, you might see that it will contribute net negatively or net positively to the model’s ability to do well on P2.
A heuristic that fires on some of P1, but doesn’t really help much, might turn out to be crucial for doing well on P2. A leftover memorised circuit that didn’t get cleaned up might add harmless “noise” on net on P1, but ruin generalisation to P2.
I would expect this to be reasonably common. A very general solution is probably overkill for a narrow sub dataset, containing many circuits that check for possible exception cases, but aren’t really necessary for that particular class of inputs. If you throw out everything that doesn’t do much to the loss on net, your explanations will miss the existence of these circuits, and you might wrongly conclude that the solution you are looking at is narrow and will not generalise.
The reason to stamp it down to a one-dimensional quantity is that sometimes the phenomenon that we wanted to explain is the expectation of a one-dimensional quantity, and we don’t want to require that our tests explain things other than that particular quantity. For example, in an alignment context, I might want to understand why my model does well on the validation set, perhaps in the hope that if I understand why the model performs well, I’ll be able to predict whether it will generalize correctly onto a particular new distribution.
The main problem with evaluating a hypothesis by KL divergence is that if you do this, your explanation looks bad in cases like the following: There’s some interference in the model which manifests as random noise, and the explanation failed to preserve the interference pattern. In this case, your explanation has a bunch of random error in its prediction of what the model does, which will hurt the KL. But that interference was random and understanding it won’t help you know if the mechanism that the model was using is going to generalize well to another distribution.
There are other similar cases than the interference one. For example, if your model has a heuristic that fires on some of the subdistribution you’re trying to understand the model’s behavior on, but not in a way that ends up affecting the model’s average performance, this is basically another source of noise that you (at least often) end up not wanting your explanation to have to capture.
As an alternative summary statistic of the extent to which the ablated model performs worse on average, I would suggest the Bayesian Wilcoxon signed-rank test.
I would take this as indication that the explanation is inadequate. If I said that the linear combination of nodes n1−1.3⋅n2 at layer l of a NN implements the function f, but in fact it implements f+ag, where g does some other thing, my hypothesis was incorrect, and I’d want the metric to show that. If I haven’t even disentangled the mechanism I claim to have found from all the other surrounding circuits, I don’t think I get to say my hypothesis is doing a good job. Otherwise it seems like I have a lot of freedom to make up spurious hypotheses that claim whatever, and hide the inadequacies as “small random fluctuations” in the ablated test loss.
I don’t see how the dimensionality of the quantity you want to understand the generative mechanism of relates to the dimensionality of the comparison you would want to carry out to evaluate a proposed generative mechanism.
I want to understand how the model computes its outputs to get loss l on distribution P1, so I can predict what loss it will get on another distribution P2. I make a hypothesis for what the mechanism is. The mechanism implies that doing intervention h on the network, say shifting x5 to 2.0⋅x5, should not change behaviour, because the NN only cares about sgn(x5), not its magnitude. If I then see that the intervention does shift output behaviour, even if it does not change the value of l on net, my hypothesis was wrong. The magnitude of x5 does play a part in the network’s computations on P1. It has an influence on the output.
If it had no effect on how outputs for P1 are computed, then destroying it should not change behaviour on P1. So there should be no divergence between the original and ablated models’ outputs. If it did affect behaviour on P1, but not in ways that contribute net negatively or net positively to the accuracy on that particular distribution, it seems that you still want to know about it, because once you understand what it does, you might see that it will contribute net negatively or net positively to the model’s ability to do well on P2.
A heuristic that fires on some of P1, but doesn’t really help much, might turn out to be crucial for doing well on P2. A leftover memorised circuit that didn’t get cleaned up might add harmless “noise” on net on P1, but ruin generalisation to P2.
I would expect this to be reasonably common. A very general solution is probably overkill for a narrow sub dataset, containing many circuits that check for possible exception cases, but aren’t really necessary for that particular class of inputs. If you throw out everything that doesn’t do much to the loss on net, your explanations will miss the existence of these circuits, and you might wrongly conclude that the solution you are looking at is narrow and will not generalise.