These seem like important intuitions, but I’m not sure I understand or share them. Suppose I identify a sentiment feature. I agree there’s a lot of room for variation in what precise notion of sentiment the model is using, and there are lots of different ways this sentiment feature could be interacting with the network that are difficult to understand. But maybe I don’t really care about that, I just want a classifier for something which is close enough to my internal notion of sentiment.
Sure, but then why not just train a probe? If we don’t care about much precision what goes wrong with the probe approach?
It’s possible to improve on a just a probe trained on the data we can construct of course, but you’ll need non-trivial precision to do so.
The key question here is “why does selecting a feature work while just naively training a probe fails”.
We have to be getting some additional bits from our selection of the feature.
In more detail, let’s suppose we use the following process:
Select all features which individually get loss < X on our training set. Choose X such that if we get that loss on our training set, we’re only worried about generalization error rather than errors which show up on the training set (~equivalently, we’re well into diminishing returns on loss).
Try to pick among these features or combine these features to produce a better classifer.
Then there are two issues:
Maybe there isn’t any feature which gets < X loss. (We can relax our requirements, but we wanted to compete with probes!)
When we select among these features do we get a non-trivial number of “bits” of improvement? Is that enough bits to achieve what we wanted? I’m somewhat skeptical we can get much if any improvement here. (Of course, note that this doesn’t mean the idea has no promise!)
IMO there are two kinda separate (and separable) things going on:
Maybe features are a good prior for classifiers for some reason for the decomposition you picked. (I don’t really see why this would be true for currently used decompositions.) Why would selecting features based on performing well on our dataset be better than just training a probe?
Maybe looking at the connections of your classifer (what earlier features it connects to and what these connect to) and applying selection to the classifer based on the connections will be good. This can totally be applied to probe based classifiers. (Maybe there is some reason why looking at connections will be especially good for classifers based on picking a feature but not training a probe, but if so, why?)
Maybe looking at the connections of your classifer (what earlier features it connects to and what these connect to) and applying selection to the classifer based on the connections will be good. This can totally be applied to probes. (Maybe there is some reason why looking at connections will be especially good for features but not probes, but if so, why?)
“Can this be applied to probes” is a crux for me. It sounds like you’re imagining something like:
Train a bunch of truthfulness probes regularized to be distinct from each other.
Train a bunch of probes for “blacklsited” features which we don’t think should be associated to truth (e.g. social reasoning, intent to lie, etc.).
(Unsure about this step.) Check which truth directions are causally downstream of blacklisted feature directions (with patching experiments?). Use that to discriminate among the probes.
Is that right?
This is not an option I had considered, and it would be very exciting to me if it worked. I have some vague intuition that this should all go better when you are working with features (e.g. because the causal dependencies among the features should be sparse), but I would definitely need to think about that position more.
“Can this be applied to probes” is a crux for me. It sounds like you’re imagining something like:
I was actually imagining a hybrid between probes and features. The actual classifier doesn’t need to be part of a complete decomposition, but for the connections we do maybe want the complete decomposition to fully analyze connections including the recursive case.
So:
Train a bunch of truthfulness probes regularized to be distinct from each other.
Check feature connections for these probes and select accordingly.
I also think there’s a pretty straightforward to do this without needing to train a bunch of probes (e.g. train probes to be orthogonal to undesirable stuff or whatever rather than needing to train a bunch).
As you mentioned you probably can do with entirely just learned probes, via a mechanism like the one you said (but this is less clean than decomposition).
You could also apply amnesic probing which has somewhat different properties than looking at the decomposition (amnesic probing is where you remove some dimensions to avoid being able to discriminate certain classes via LEACE as we discussed in the measurement tampering paper).
(TBC, doesn’t seem that useful to argue about “what is mech interp”, I think the more central question is “how likely is it that all this prior work and ideas related to mech interp are useful”. This is a strictly higher bar, but we should apply the same adjustment for work in all other areas etc.)
More generally, it seems good to be careful about thinking through questions like “does using X have a principled reason to be better than applying the ‘default’ approach (e.g. training a probe)”. Good to do this regardless of actually using the default approach so we know where the juice is coming from.
In the case of mech interp style decompositions, I’m pretty skeptical that there is any juice in finding your classifier by doing something like selecting over components rather than training a probe. But, there could theoretically be juice in trying to understand how a probe works by looking at its connections (and the connections of its connections etc).
Sure, but then why not just train a probe? If we don’t care about much precision what goes wrong with the probe approach?
Here’s a reasonable example where naively training a probe fails. The model lies if any of N features is “true”. One of the features is almost always activated at the same time as some others, such that in the training set it never solely determines whether the model lies.
Then, a probe trained on the activations may not pick up on that feature. Whereas if we can look at model weights, we can see that this feature also matters, and include it in our lying classifier.
This particular case can also be solved by adversarially attacking the probe though.
Sure, but then why not just train a probe? If we don’t care about much precision what goes wrong with the probe approach?
It’s possible to improve on a just a probe trained on the data we can construct of course, but you’ll need non-trivial precision to do so.
The key question here is “why does selecting a feature work while just naively training a probe fails”.
We have to be getting some additional bits from our selection of the feature.
In more detail, let’s suppose we use the following process:
Select all features which individually get loss < X on our training set. Choose X such that if we get that loss on our training set, we’re only worried about generalization error rather than errors which show up on the training set (~equivalently, we’re well into diminishing returns on loss).
Try to pick among these features or combine these features to produce a better classifer.
Then there are two issues:
Maybe there isn’t any feature which gets < X loss. (We can relax our requirements, but we wanted to compete with probes!)
When we select among these features do we get a non-trivial number of “bits” of improvement? Is that enough bits to achieve what we wanted? I’m somewhat skeptical we can get much if any improvement here. (Of course, note that this doesn’t mean the idea has no promise!)
IMO there are two kinda separate (and separable) things going on:
Maybe features are a good prior for classifiers for some reason for the decomposition you picked. (I don’t really see why this would be true for currently used decompositions.) Why would selecting features based on performing well on our dataset be better than just training a probe?
Maybe looking at the connections of your classifer (what earlier features it connects to and what these connect to) and applying selection to the classifer based on the connections will be good. This can totally be applied to probe based classifiers. (Maybe there is some reason why looking at connections will be especially good for classifers based on picking a feature but not training a probe, but if so, why?)
“Can this be applied to probes” is a crux for me. It sounds like you’re imagining something like:
Train a bunch of truthfulness probes regularized to be distinct from each other.
Train a bunch of probes for “blacklsited” features which we don’t think should be associated to truth (e.g. social reasoning, intent to lie, etc.).
(Unsure about this step.) Check which truth directions are causally downstream of blacklisted feature directions (with patching experiments?). Use that to discriminate among the probes.
Is that right?
This is not an option I had considered, and it would be very exciting to me if it worked. I have some vague intuition that this should all go better when you are working with features (e.g. because the causal dependencies among the features should be sparse), but I would definitely need to think about that position more.
I was actually imagining a hybrid between probes and features. The actual classifier doesn’t need to be part of a complete decomposition, but for the connections we do maybe want the complete decomposition to fully analyze connections including the recursive case.
So:
Train a bunch of truthfulness probes regularized to be distinct from each other.
Check feature connections for these probes and select accordingly.
I also think there’s a pretty straightforward to do this without needing to train a bunch of probes (e.g. train probes to be orthogonal to undesirable stuff or whatever rather than needing to train a bunch).
As you mentioned you probably can do with entirely just learned probes, via a mechanism like the one you said (but this is less clean than decomposition).
You could also apply amnesic probing which has somewhat different properties than looking at the decomposition (amnesic probing is where you remove some dimensions to avoid being able to discriminate certain classes via LEACE as we discussed in the measurement tampering paper).
(TBC, doesn’t seem that useful to argue about “what is mech interp”, I think the more central question is “how likely is it that all this prior work and ideas related to mech interp are useful”. This is a strictly higher bar, but we should apply the same adjustment for work in all other areas etc.)
More generally, it seems good to be careful about thinking through questions like “does using X have a principled reason to be better than applying the ‘default’ approach (e.g. training a probe)”. Good to do this regardless of actually using the default approach so we know where the juice is coming from.
In the case of mech interp style decompositions, I’m pretty skeptical that there is any juice in finding your classifier by doing something like selecting over components rather than training a probe. But, there could theoretically be juice in trying to understand how a probe works by looking at its connections (and the connections of its connections etc).
Here’s a reasonable example where naively training a probe fails. The model lies if any of N features is “true”. One of the features is almost always activated at the same time as some others, such that in the training set it never solely determines whether the model lies.
Then, a probe trained on the activations may not pick up on that feature. Whereas if we can look at model weights, we can see that this feature also matters, and include it in our lying classifier.
This particular case can also be solved by adversarially attacking the probe though.