Maybe looking at the connections of your classifer (what earlier features it connects to and what these connect to) and applying selection to the classifer based on the connections will be good. This can totally be applied to probes. (Maybe there is some reason why looking at connections will be especially good for features but not probes, but if so, why?)
“Can this be applied to probes” is a crux for me. It sounds like you’re imagining something like:
Train a bunch of truthfulness probes regularized to be distinct from each other.
Train a bunch of probes for “blacklsited” features which we don’t think should be associated to truth (e.g. social reasoning, intent to lie, etc.).
(Unsure about this step.) Check which truth directions are causally downstream of blacklisted feature directions (with patching experiments?). Use that to discriminate among the probes.
Is that right?
This is not an option I had considered, and it would be very exciting to me if it worked. I have some vague intuition that this should all go better when you are working with features (e.g. because the causal dependencies among the features should be sparse), but I would definitely need to think about that position more.
“Can this be applied to probes” is a crux for me. It sounds like you’re imagining something like:
I was actually imagining a hybrid between probes and features. The actual classifier doesn’t need to be part of a complete decomposition, but for the connections we do maybe want the complete decomposition to fully analyze connections including the recursive case.
So:
Train a bunch of truthfulness probes regularized to be distinct from each other.
Check feature connections for these probes and select accordingly.
I also think there’s a pretty straightforward to do this without needing to train a bunch of probes (e.g. train probes to be orthogonal to undesirable stuff or whatever rather than needing to train a bunch).
As you mentioned you probably can do with entirely just learned probes, via a mechanism like the one you said (but this is less clean than decomposition).
You could also apply amnesic probing which has somewhat different properties than looking at the decomposition (amnesic probing is where you remove some dimensions to avoid being able to discriminate certain classes via LEACE as we discussed in the measurement tampering paper).
(TBC, doesn’t seem that useful to argue about “what is mech interp”, I think the more central question is “how likely is it that all this prior work and ideas related to mech interp are useful”. This is a strictly higher bar, but we should apply the same adjustment for work in all other areas etc.)
More generally, it seems good to be careful about thinking through questions like “does using X have a principled reason to be better than applying the ‘default’ approach (e.g. training a probe)”. Good to do this regardless of actually using the default approach so we know where the juice is coming from.
In the case of mech interp style decompositions, I’m pretty skeptical that there is any juice in finding your classifier by doing something like selecting over components rather than training a probe. But, there could theoretically be juice in trying to understand how a probe works by looking at its connections (and the connections of its connections etc).
“Can this be applied to probes” is a crux for me. It sounds like you’re imagining something like:
Train a bunch of truthfulness probes regularized to be distinct from each other.
Train a bunch of probes for “blacklsited” features which we don’t think should be associated to truth (e.g. social reasoning, intent to lie, etc.).
(Unsure about this step.) Check which truth directions are causally downstream of blacklisted feature directions (with patching experiments?). Use that to discriminate among the probes.
Is that right?
This is not an option I had considered, and it would be very exciting to me if it worked. I have some vague intuition that this should all go better when you are working with features (e.g. because the causal dependencies among the features should be sparse), but I would definitely need to think about that position more.
I was actually imagining a hybrid between probes and features. The actual classifier doesn’t need to be part of a complete decomposition, but for the connections we do maybe want the complete decomposition to fully analyze connections including the recursive case.
So:
Train a bunch of truthfulness probes regularized to be distinct from each other.
Check feature connections for these probes and select accordingly.
I also think there’s a pretty straightforward to do this without needing to train a bunch of probes (e.g. train probes to be orthogonal to undesirable stuff or whatever rather than needing to train a bunch).
As you mentioned you probably can do with entirely just learned probes, via a mechanism like the one you said (but this is less clean than decomposition).
You could also apply amnesic probing which has somewhat different properties than looking at the decomposition (amnesic probing is where you remove some dimensions to avoid being able to discriminate certain classes via LEACE as we discussed in the measurement tampering paper).
(TBC, doesn’t seem that useful to argue about “what is mech interp”, I think the more central question is “how likely is it that all this prior work and ideas related to mech interp are useful”. This is a strictly higher bar, but we should apply the same adjustment for work in all other areas etc.)
More generally, it seems good to be careful about thinking through questions like “does using X have a principled reason to be better than applying the ‘default’ approach (e.g. training a probe)”. Good to do this regardless of actually using the default approach so we know where the juice is coming from.
In the case of mech interp style decompositions, I’m pretty skeptical that there is any juice in finding your classifier by doing something like selecting over components rather than training a probe. But, there could theoretically be juice in trying to understand how a probe works by looking at its connections (and the connections of its connections etc).