CaSc can fail to reject a hypothesis if it is too unspecific and is extensionally equivalent to the true hypothesis.
Seems to me like this is easily resolved so long as you don’t screw up your book keeping. In your example, the hypothesis implicitly only makes a claim about the information going out of the bubble. So long as you always write down which nodes or layers of the network your hypothesis makes what claims about, I think this should be fine?
On the input-output level, we found that CaSc can fail to reject false hypotheses due to cancellation, i.e. because the task has a certain structural distribution that does not allow resampling to differentiate between different hypotheses.
I don’t know that much about CaSc, but why are you comparing the ablated graphs to the originals via their separate loss on the data in the first place? Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.
Wouldn’t you want to directly compare the divergence on outputs between the original graph G and ablated graph I instead? The DKL divergence between their output distributions over the data is the first thing that’d come to my mind. Or keeping whatever the original loss function is, but with the outputs of G as the new ground truth labels.
That’s still ad hocery of course, but it should at least take care of the failure mode you point out here. Is this really not part of current CaSc?
Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.
The reason to stamp it down to a one-dimensional quantity is that sometimes the phenomenon that we wanted to explain is the expectation of a one-dimensional quantity, and we don’t want to require that our tests explain things other than that particular quantity. For example, in an alignment context, I might want to understand why my model does well on the validation set, perhaps in the hope that if I understand why the model performs well, I’ll be able to predict whether it will generalize correctly onto a particular new distribution.
The main problem with evaluating a hypothesis by KL divergence is that if you do this, your explanation looks bad in cases like the following: There’s some interference in the model which manifests as random noise, and the explanation failed to preserve the interference pattern. In this case, your explanation has a bunch of random error in its prediction of what the model does, which will hurt the KL. But that interference was random and understanding it won’t help you know if the mechanism that the model was using is going to generalize well to another distribution.
There are other similar cases than the interference one. For example, if your model has a heuristic that fires on some of the subdistribution you’re trying to understand the model’s behavior on, but not in a way that ends up affecting the model’s average performance, this is basically another source of noise that you (at least often) end up not wanting your explanation to have to capture.
As an alternative summary statistic of the extent to which the ablated model performs worse on average, I would suggest the Bayesian Wilcoxon signed-rank test.
The main problem with evaluating a hypothesis by KL divergence is that if you do this, your explanation looks bad in cases like the following:
I would take this as indication that the explanation is inadequate. If I said that the linear combination of nodes n1−1.3⋅n2 at layer l of a NN implements the function f, but in fact it implements f+ag, where g does some other thing, my hypothesis was incorrect, and I’d want the metric to show that. If I haven’t even disentangled the mechanism I claim to have found from all the other surrounding circuits, I don’t think I get to say my hypothesis is doing a good job. Otherwise it seems like I have a lot of freedom to make up spurious hypotheses that claim whatever, and hide the inadequacies as “small random fluctuations” in the ablated test loss.
The reason to stamp it down to a one-dimensional quantity is that sometimes the phenomenon that we wanted to explain is the expectation of a one-dimensional quantity, and we don’t want to require that our tests explain things other than that particular quantity. For example, in an alignment context, I might want to understand why my model does well on the validation set, perhaps in the hope that if I understand why the model performs well, I’ll be able to predict whether it will generalize correctly onto a particular new distribution.
I don’t see how the dimensionality of the quantity you want to understand the generative mechanism of relates to the dimensionality of the comparison you would want to carry out to evaluate a proposed generative mechanism.
I want to understand how the model computes its outputs to get loss l on distribution P1, so I can predict what loss it will get on another distribution P2. I make a hypothesis for what the mechanism is. The mechanism implies that doing intervention h on the network, say shifting x5 to 2.0⋅x5, should not change behaviour, because the NN only cares about sgn(x5), not its magnitude. If I then see that the intervention does shift output behaviour, even if it does not change the value of l on net, my hypothesis was wrong. The magnitude of x5 does play a part in the network’s computations on P1. It has an influence on the output.
But that interference was random and understanding it won’t help you know if the mechanism that the model was using is going to generalize well to another distribution.
If it had no effect on how outputs for P1 are computed, then destroying it should not change behaviour on P1. So there should be no divergence between the original and ablated models’ outputs. If it did affect behaviour on P1, but not in ways that contribute net negatively or net positively to the accuracy on that particular distribution, it seems that you still want to know about it, because once you understand what it does, you might see that it will contribute net negatively or net positively to the model’s ability to do well on P2.
A heuristic that fires on some of P1, but doesn’t really help much, might turn out to be crucial for doing well on P2. A leftover memorised circuit that didn’t get cleaned up might add harmless “noise” on net on P1, but ruin generalisation to P2.
I would expect this to be reasonably common. A very general solution is probably overkill for a narrow sub dataset, containing many circuits that check for possible exception cases, but aren’t really necessary for that particular class of inputs. If you throw out everything that doesn’t do much to the loss on net, your explanations will miss the existence of these circuits, and you might wrongly conclude that the solution you are looking at is narrow and will not generalise.
Note, assuming the test/validation distribution is an empirical dataset (i.e. a finite mixture of Dirac deltas), and the original graph G is deterministic, the DKL of the pushforward distributions on the outputs of the computational graph will typically be infinite. In this context you would need to use a Wasserstein divergence, or to “thicken” the distributions by adding absolutely-continuous noise to the input and/or output.
Or maybe you meant in cases where the output is a softmax layer and interpreted as a probability distribution, in which case ExDKL(I(x)||G(x)) does seem reasonable. Which does seem like a special case of the following sentence where you suggest using the original loss function but substituting the unablated model for the supervision targets—that also seems like a good summary statistic to look at.
Seems to me like this is easily resolved so long as you don’t screw up your book keeping. In your example, the hypothesis implicitly only makes a claim about the information going out of the bubble. So long as you always write down which nodes or layers of the network your hypothesis makes what claims about, I think this should be fine?
Yes totally agree. Here we are not claiming that this is a failure mode of CaSc, and it can “easily” be resolved by making your hypothesis more specific. We are merely pointing out that “In theory, this is a trivial point, but we found that in practice, it is easy to miss this distinction when there is an “obvious” algorithm to implement a given function.”
I don’t know that much about CaSc, but why are you comparing the ablated graphs to the originals via their separate loss on the data in the first place? Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.
You are right that this is a failure mode that is mostly due to reducing the behavior down into a single aggregate quantity like the average loss recovered. It can be remedied when looking at the loss on individual samples and not averaging the metric across the whole dataset. In the footnote, we point out that researchers at Redwood Research have actually also started looking at the per-sample loss instead of the aggregate loss.
CaSc was, however, introduced by looking at the average scrubbed loss (even though they say that this metric is not ideal). Also, in practice, when one iterates on generating hypotheses and testing them with CaSc, it’s more convenient to look at aggregate metrics. We thus think it is useful to have concrete examples that show how this can lead to problems.
Your suggestion of using DKL seems a useful improvement compared to most metrics. It’s, however, still possible that cancellation could occur. Cancellation is mostly due to aggregating over a metric (e.g., the mean) and less due to the specific metric used (although I could imagine that some metrics like DKL could allow for less ambiguity).
Your suggestion of using DKL seems a useful improvement compared to most metrics. It’s, however, still possible that cancellation could occur. Cancellation is mostly due to aggregating over a metric (e.g., the mean) and less due to the specific metric used (although I could imagine that some metrics like DKL could allow for less ambiguity).
It’s not about DKL vs. some other loss function. It’s about using a one dimensional summary of a high dimensional comparison, instead of a one dimensional comparison. There are many ways for two neural networks to both diverge from some training labels y by an average loss l while spitting out very different outputs. There are tautologically no ways for two neural networks to have different output behaviour without having non-zero divergence in label assignment for at least some data points. Thus, it seems that you would want a metric that aggregates the divergence of the two networks’ outputs from each other, not a metric that compares their separate aggregated divergences from some unrelated data labels and so throws away most of the information.
A low dimensional summary of a high dimensional comparison between the networks seems fine(ish). A low dimensional comparison between the networks based on the summaries of their separate comparisons to a third distribution throws away a lot of the relevant information.
Seems to me like this is easily resolved so long as you don’t screw up your book keeping. In your example, the hypothesis implicitly only makes a claim about the information going out of the bubble. So long as you always write down which nodes or layers of the network your hypothesis makes what claims about, I think this should be fine?
I don’t know that much about CaSc, but why are you comparing the ablated graphs to the originals via their separate loss on the data in the first place? Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.
Wouldn’t you want to directly compare the divergence on outputs between the original graph G and ablated graph I instead? The DKL divergence between their output distributions over the data is the first thing that’d come to my mind. Or keeping whatever the original loss function is, but with the outputs of G as the new ground truth labels.
That’s still ad hocery of course, but it should at least take care of the failure mode you point out here. Is this really not part of current CaSc?
The reason to stamp it down to a one-dimensional quantity is that sometimes the phenomenon that we wanted to explain is the expectation of a one-dimensional quantity, and we don’t want to require that our tests explain things other than that particular quantity. For example, in an alignment context, I might want to understand why my model does well on the validation set, perhaps in the hope that if I understand why the model performs well, I’ll be able to predict whether it will generalize correctly onto a particular new distribution.
The main problem with evaluating a hypothesis by KL divergence is that if you do this, your explanation looks bad in cases like the following: There’s some interference in the model which manifests as random noise, and the explanation failed to preserve the interference pattern. In this case, your explanation has a bunch of random error in its prediction of what the model does, which will hurt the KL. But that interference was random and understanding it won’t help you know if the mechanism that the model was using is going to generalize well to another distribution.
There are other similar cases than the interference one. For example, if your model has a heuristic that fires on some of the subdistribution you’re trying to understand the model’s behavior on, but not in a way that ends up affecting the model’s average performance, this is basically another source of noise that you (at least often) end up not wanting your explanation to have to capture.
As an alternative summary statistic of the extent to which the ablated model performs worse on average, I would suggest the Bayesian Wilcoxon signed-rank test.
I would take this as indication that the explanation is inadequate. If I said that the linear combination of nodes n1−1.3⋅n2 at layer l of a NN implements the function f, but in fact it implements f+ag, where g does some other thing, my hypothesis was incorrect, and I’d want the metric to show that. If I haven’t even disentangled the mechanism I claim to have found from all the other surrounding circuits, I don’t think I get to say my hypothesis is doing a good job. Otherwise it seems like I have a lot of freedom to make up spurious hypotheses that claim whatever, and hide the inadequacies as “small random fluctuations” in the ablated test loss.
I don’t see how the dimensionality of the quantity you want to understand the generative mechanism of relates to the dimensionality of the comparison you would want to carry out to evaluate a proposed generative mechanism.
I want to understand how the model computes its outputs to get loss l on distribution P1, so I can predict what loss it will get on another distribution P2. I make a hypothesis for what the mechanism is. The mechanism implies that doing intervention h on the network, say shifting x5 to 2.0⋅x5, should not change behaviour, because the NN only cares about sgn(x5), not its magnitude. If I then see that the intervention does shift output behaviour, even if it does not change the value of l on net, my hypothesis was wrong. The magnitude of x5 does play a part in the network’s computations on P1. It has an influence on the output.
If it had no effect on how outputs for P1 are computed, then destroying it should not change behaviour on P1. So there should be no divergence between the original and ablated models’ outputs. If it did affect behaviour on P1, but not in ways that contribute net negatively or net positively to the accuracy on that particular distribution, it seems that you still want to know about it, because once you understand what it does, you might see that it will contribute net negatively or net positively to the model’s ability to do well on P2.
A heuristic that fires on some of P1, but doesn’t really help much, might turn out to be crucial for doing well on P2. A leftover memorised circuit that didn’t get cleaned up might add harmless “noise” on net on P1, but ruin generalisation to P2.
I would expect this to be reasonably common. A very general solution is probably overkill for a narrow sub dataset, containing many circuits that check for possible exception cases, but aren’t really necessary for that particular class of inputs. If you throw out everything that doesn’t do much to the loss on net, your explanations will miss the existence of these circuits, and you might wrongly conclude that the solution you are looking at is narrow and will not generalise.
Note, assuming the test/validation distribution is an empirical dataset (i.e. a finite mixture of Dirac deltas), and the original graph G is deterministic, the DKL of the pushforward distributions on the outputs of the computational graph will typically be infinite. In this context you would need to use a Wasserstein divergence, or to “thicken” the distributions by adding absolutely-continuous noise to the input and/or output.
Or maybe you meant in cases where the output is a softmax layer and interpreted as a probability distribution, in which case ExDKL(I(x)||G(x)) does seem reasonable. Which does seem like a special case of the following sentence where you suggest using the original loss function but substituting the unablated model for the supervision targets—that also seems like a good summary statistic to look at.
Second paragraph is what I meant, thanks.
Yes totally agree. Here we are not claiming that this is a failure mode of CaSc, and it can “easily” be resolved by making your hypothesis more specific. We are merely pointing out that “In theory, this is a trivial point, but we found that in practice, it is easy to miss this distinction when there is an “obvious” algorithm to implement a given function.”
I don’t know that much about CaSc, but why are you comparing the ablated graphs to the originals via their separate loss on the data in the first place? Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.
You are right that this is a failure mode that is mostly due to reducing the behavior down into a single aggregate quantity like the average loss recovered. It can be remedied when looking at the loss on individual samples and not averaging the metric across the whole dataset. In the footnote, we point out that researchers at Redwood Research have actually also started looking at the per-sample loss instead of the aggregate loss.
CaSc was, however, introduced by looking at the average scrubbed loss (even though they say that this metric is not ideal). Also, in practice, when one iterates on generating hypotheses and testing them with CaSc, it’s more convenient to look at aggregate metrics. We thus think it is useful to have concrete examples that show how this can lead to problems.
Your suggestion of using DKL seems a useful improvement compared to most metrics. It’s, however, still possible that cancellation could occur. Cancellation is mostly due to aggregating over a metric (e.g., the mean) and less due to the specific metric used (although I could imagine that some metrics like DKL could allow for less ambiguity).
It’s not about DKL vs. some other loss function. It’s about using a one dimensional summary of a high dimensional comparison, instead of a one dimensional comparison. There are many ways for two neural networks to both diverge from some training labels y by an average loss l while spitting out very different outputs. There are tautologically no ways for two neural networks to have different output behaviour without having non-zero divergence in label assignment for at least some data points. Thus, it seems that you would want a metric that aggregates the divergence of the two networks’ outputs from each other, not a metric that compares their separate aggregated divergences from some unrelated data labels and so throws away most of the information.
A low dimensional summary of a high dimensional comparison between the networks seems fine(ish). A low dimensional comparison between the networks based on the summaries of their separate comparisons to a third distribution throws away a lot of the relevant information.