Here’s a hope which (I think) isn’t on the list. It’s somewhat related to the hope that Habryka raised, though a bit different and more specific.
Approach: maybe model internals overtly represent qualities which distinguish desired vs. undesired cognition, but probing is insufficient for some reason (e.g. because we don’t have good enough oversight to produce labeled data to train a probe with).
I don’t think this exact thing is directly mentioned by my list. Thanks for the addition.
Let me try to state something which captures most of that approach to make sure I understand:
Approach: Maybe we can find some decomposition of model internals[1] such that all or most components directly related to some particular aspect of cognition are overtly obvious and there are also a small number of such components. Then, maybe we can analyze, edit, or build a classifier using these components in cases where baseline training techniques (e.g. probing) are insufficient.
Then, it seems like there are two cases where this is useful:
Merely capturing all (or most) of the components like this is directly useful for ablation or building a classifier (e.g. because we want an ALL over the truth notions in the model like in your example or because we want to do precise removal of some behavior or capability). This requires that our list of components can be sufficiently small such using all of the them doesn’t bite too hard (e.g. low enough FPR) and that this list of components includes enough of the action that using all of them is sufficiently reliable (e.g. high enough TPR).
Even without detailed understanding and potentially without capturing “all” components, we can further identify components by looking at their connections or doing similar intervention experiments at a high level. Then, we can use our analysis of these components do something useful (e.g. determine which components correspond to humans merely thinking something is true).
This impact story seems overall somewhat resonable to me. It’s worth noting that I can’t imagine this resulting in vary ambitious applications, though the reduction in doom could still be substantial. My main concerns are:
Both stories depend on our decomposition resulting in components which are possible recognize and label to quite a high level of specificity despite not building much understanding of the exact behavior. This seems like a strong property and it seems unlikely we’d be able to find an unsupervised decomposition which consistently has this property for the cases we care about. (I don’t see why sparsity would have this property to the extent we need, but it seems at least a bit plausible and it’s probably better than the default.)
More generally, it feels this story is supposing some level of “magic” on the part of our decomposition. If we don’t understand much of the training compute then there will be decompositions which look to us like a good enough decomposition while hiding arbitrary stuff in the residual between our understanding and what’s going on. It’s possible that in practice, some unsupervised decomposition (e.g. SAE) cleanly breaks things apart into components which are easy to label while simultaneously these labels are quite specific and quite accurate. But why would this be true? (Maybe forthcoming research will demonstrate this, but my inside view thinks this is unlikely.)
If we want to look at connections, then imperfect understanding will probably bite pretty hard particularly as the effect size of the connection gets smaller and smaller (either due to path length >1 or just there being many things which are directly connected but have a small effect).
If you thought that current fundamental science in mech interp was close to doing this, I think I’d probably be excited about building test bed(s) where you think this sort of approach could be usefully applied and which aren’t trivially solved by other methods. If you don’t think the fundamentals of mech interp are close, it would be interesting to understand what you think will change to make this story viable in the future (better decompositions? something else?).
Let me try to state something which captures most of that approach to make sure I understand:
Everything you wrote describing the hope looks right to me.
It’s worth noting that I can’t imagine this resulting in vary ambitious applications, though the reduction in doom could still be substantial.
To be clear, what does “ambitious” mean here? Does it mean “producing a large degree of understanding?”
If we don’t understand much of the training compute then there will be decompositions which look to us like a good enough decomposition while hiding arbitrary stuff in the residual between our understanding and what’s going on.
[...]
If we want to look at connections, then imperfect understanding will probably bite pretty hard particularly as the effect size of the connection gets smaller and smaller (either due to path length >1 or just there being many things which are directly connected but have a small effect).
These seem like important intuitions, but I’m not sure I understand or share them. Suppose I identify a sentiment feature. I agree there’s a lot of room for variation in what precise notion of sentiment the model is using, and there are lots of different ways this sentiment feature could be interacting with the network that are difficult to understand. But maybe I don’t really care about that, I just want a classifier for something which is close enough to my internal notion of sentiment.
Just so with truth: there’s probably lots of different subtly different notions of truth, but for the application of “detecting whether my AI believes statement X to be true” I don’t care about that. I do care about the difference between “true” and “humans think is true,” but that’s a big difference that I can understand (even if I can’t produce examples), and where I can articulate the sorts of cognition which probably should/shouldn’t be involved in it.
What’s the specific way you imagine this failing? Some options:
None of the features we identify really seem to correspond to something resembling our intuitive notion of “truth” (e.g. because they frequently activate on unrelated concepts).
We get a bunch of features that look like truth, but can’t really tell what goes into computing them.
We get a bunch of features that look like truth and we have some vague sense of how they’re computed, but they don’t seem differentiated in how “sketchy” these computational graphs look: either they all seem to rely on social reasoning or they all don’t seem to.
Maybe a better question would be—why didn’t these issues (lack of robust explanation) get in the way of the Steinhardt paper I linked? They were in fact able to execute something like the plan I sketch here: use vague understanding to guess which model components attend to features which are spuriously correlated with the thing you want, then use the rest of the model as an improved classifier for the thing you want.
What’s the specific way you imagine this failing? Some options:
My proposed list (which borrows from your list):
We find a large number (e.g. 30,000) of features which all sorta look somewhat like truth, though none exactly look like truth. Further analysis doesn’t make it clear which of these are “real” or “actually truth”. Some features look more like truth and some look a bit less like truth, but broadly there is a smooth fall of in how “truth like” the features look such that there aren’t a small set of discrete truth features. No single feature both looks like truth and correlates perfectly with our labeled datasets.
We get a bunch of features that look (at least somewhat) like truth and we have some vague sense of how they’re computed, but they don’t seem differentiated in how “sketchy” these computational graphs look: either they all seem to rely on social reasoning or they all don’t seem to.
We get a bunch of features that look like truth, but looking at what they connect to doesn’t make much sense and just makes us more confused overall. There are many diffuse connections and it’s unclear what they do.
Everything looks fine and we apply the method, but it turns out there isn’t any feature we’ve identified which corresponds to “actual truth” as this isn’t very salient for the model in the regime we are interested in.
These seem like important intuitions, but I’m not sure I understand or share them. Suppose I identify a sentiment feature. I agree there’s a lot of room for variation in what precise notion of sentiment the model is using, and there are lots of different ways this sentiment feature could be interacting with the network that are difficult to understand. But maybe I don’t really care about that, I just want a classifier for something which is close enough to my internal notion of sentiment.
Sure, but then why not just train a probe? If we don’t care about much precision what goes wrong with the probe approach?
It’s possible to improve on a just a probe trained on the data we can construct of course, but you’ll need non-trivial precision to do so.
The key question here is “why does selecting a feature work while just naively training a probe fails”.
We have to be getting some additional bits from our selection of the feature.
In more detail, let’s suppose we use the following process:
Select all features which individually get loss < X on our training set. Choose X such that if we get that loss on our training set, we’re only worried about generalization error rather than errors which show up on the training set (~equivalently, we’re well into diminishing returns on loss).
Try to pick among these features or combine these features to produce a better classifer.
Then there are two issues:
Maybe there isn’t any feature which gets < X loss. (We can relax our requirements, but we wanted to compete with probes!)
When we select among these features do we get a non-trivial number of “bits” of improvement? Is that enough bits to achieve what we wanted? I’m somewhat skeptical we can get much if any improvement here. (Of course, note that this doesn’t mean the idea has no promise!)
IMO there are two kinda separate (and separable) things going on:
Maybe features are a good prior for classifiers for some reason for the decomposition you picked. (I don’t really see why this would be true for currently used decompositions.) Why would selecting features based on performing well on our dataset be better than just training a probe?
Maybe looking at the connections of your classifer (what earlier features it connects to and what these connect to) and applying selection to the classifer based on the connections will be good. This can totally be applied to probe based classifiers. (Maybe there is some reason why looking at connections will be especially good for classifers based on picking a feature but not training a probe, but if so, why?)
Maybe looking at the connections of your classifer (what earlier features it connects to and what these connect to) and applying selection to the classifer based on the connections will be good. This can totally be applied to probes. (Maybe there is some reason why looking at connections will be especially good for features but not probes, but if so, why?)
“Can this be applied to probes” is a crux for me. It sounds like you’re imagining something like:
Train a bunch of truthfulness probes regularized to be distinct from each other.
Train a bunch of probes for “blacklsited” features which we don’t think should be associated to truth (e.g. social reasoning, intent to lie, etc.).
(Unsure about this step.) Check which truth directions are causally downstream of blacklisted feature directions (with patching experiments?). Use that to discriminate among the probes.
Is that right?
This is not an option I had considered, and it would be very exciting to me if it worked. I have some vague intuition that this should all go better when you are working with features (e.g. because the causal dependencies among the features should be sparse), but I would definitely need to think about that position more.
“Can this be applied to probes” is a crux for me. It sounds like you’re imagining something like:
I was actually imagining a hybrid between probes and features. The actual classifier doesn’t need to be part of a complete decomposition, but for the connections we do maybe want the complete decomposition to fully analyze connections including the recursive case.
So:
Train a bunch of truthfulness probes regularized to be distinct from each other.
Check feature connections for these probes and select accordingly.
I also think there’s a pretty straightforward to do this without needing to train a bunch of probes (e.g. train probes to be orthogonal to undesirable stuff or whatever rather than needing to train a bunch).
As you mentioned you probably can do with entirely just learned probes, via a mechanism like the one you said (but this is less clean than decomposition).
You could also apply amnesic probing which has somewhat different properties than looking at the decomposition (amnesic probing is where you remove some dimensions to avoid being able to discriminate certain classes via LEACE as we discussed in the measurement tampering paper).
(TBC, doesn’t seem that useful to argue about “what is mech interp”, I think the more central question is “how likely is it that all this prior work and ideas related to mech interp are useful”. This is a strictly higher bar, but we should apply the same adjustment for work in all other areas etc.)
More generally, it seems good to be careful about thinking through questions like “does using X have a principled reason to be better than applying the ‘default’ approach (e.g. training a probe)”. Good to do this regardless of actually using the default approach so we know where the juice is coming from.
In the case of mech interp style decompositions, I’m pretty skeptical that there is any juice in finding your classifier by doing something like selecting over components rather than training a probe. But, there could theoretically be juice in trying to understand how a probe works by looking at its connections (and the connections of its connections etc).
Sure, but then why not just train a probe? If we don’t care about much precision what goes wrong with the probe approach?
Here’s a reasonable example where naively training a probe fails. The model lies if any of N features is “true”. One of the features is almost always activated at the same time as some others, such that in the training set it never solely determines whether the model lies.
Then, a probe trained on the activations may not pick up on that feature. Whereas if we can look at model weights, we can see that this feature also matters, and include it in our lying classifier.
This particular case can also be solved by adversarially attacking the probe though.
Maybe a better question would be—why didn’t these issues (lack of robust explanation) get in the way of the Steinhardt paper I linked? They were in fact able to execute something like the plan I sketch here: use vague understanding to guess which model components attend to features which are spuriously correlated with the thing you want, then use the rest of the model as an improved classifier for the thing you want.
My guess is that the classification task for waterbirds is sufficiently easy that butchering a substantial part of the model is fine. It won’t usually be viable to ablate everything that looks similar to an undesirable property. In some cases, this might be fine due to redundancy, but if there is heavy redundancy, I also expect that you’ve missed some stuff if you just look for components which look to a given target.
Not super high confidence overall.
Edit: it also seems likely to me that there is a more principled and simpler approach like using LEACE which works just as well or better (but I’m unsure and I’m not familiar with that paper or the literature here).
I don’t think this exact thing is directly mentioned by my list. Thanks for the addition.
Let me try to state something which captures most of that approach to make sure I understand:
Approach: Maybe we can find some decomposition of model internals[1] such that all or most components directly related to some particular aspect of cognition are overtly obvious and there are also a small number of such components. Then, maybe we can analyze, edit, or build a classifier using these components in cases where baseline training techniques (e.g. probing) are insufficient.
Then, it seems like there are two cases where this is useful:
Merely capturing all (or most) of the components like this is directly useful for ablation or building a classifier (e.g. because we want an ALL over the truth notions in the model like in your example or because we want to do precise removal of some behavior or capability). This requires that our list of components can be sufficiently small such using all of the them doesn’t bite too hard (e.g. low enough FPR) and that this list of components includes enough of the action that using all of them is sufficiently reliable (e.g. high enough TPR).
Even without detailed understanding and potentially without capturing “all” components, we can further identify components by looking at their connections or doing similar intervention experiments at a high level. Then, we can use our analysis of these components do something useful (e.g. determine which components correspond to humans merely thinking something is true).
This impact story seems overall somewhat resonable to me. It’s worth noting that I can’t imagine this resulting in vary ambitious applications, though the reduction in doom could still be substantial. My main concerns are:
Both stories depend on our decomposition resulting in components which are possible recognize and label to quite a high level of specificity despite not building much understanding of the exact behavior. This seems like a strong property and it seems unlikely we’d be able to find an unsupervised decomposition which consistently has this property for the cases we care about. (I don’t see why sparsity would have this property to the extent we need, but it seems at least a bit plausible and it’s probably better than the default.)
More generally, it feels this story is supposing some level of “magic” on the part of our decomposition. If we don’t understand much of the training compute then there will be decompositions which look to us like a good enough decomposition while hiding arbitrary stuff in the residual between our understanding and what’s going on. It’s possible that in practice, some unsupervised decomposition (e.g. SAE) cleanly breaks things apart into components which are easy to label while simultaneously these labels are quite specific and quite accurate. But why would this be true? (Maybe forthcoming research will demonstrate this, but my inside view thinks this is unlikely.)
If we want to look at connections, then imperfect understanding will probably bite pretty hard particularly as the effect size of the connection gets smaller and smaller (either due to path length >1 or just there being many things which are directly connected but have a small effect).
If you thought that current fundamental science in mech interp was close to doing this, I think I’d probably be excited about building test bed(s) where you think this sort of approach could be usefully applied and which aren’t trivially solved by other methods. If you don’t think the fundamentals of mech interp are close, it would be interesting to understand what you think will change to make this story viable in the future (better decompositions? something else?).
Either a “default” decomposition like neurons/attention heads or “non-default” decomposition like a sparse autoencoder.
Everything you wrote describing the hope looks right to me.
To be clear, what does “ambitious” mean here? Does it mean “producing a large degree of understanding?”
These seem like important intuitions, but I’m not sure I understand or share them. Suppose I identify a sentiment feature. I agree there’s a lot of room for variation in what precise notion of sentiment the model is using, and there are lots of different ways this sentiment feature could be interacting with the network that are difficult to understand. But maybe I don’t really care about that, I just want a classifier for something which is close enough to my internal notion of sentiment.
Just so with truth: there’s probably lots of different subtly different notions of truth, but for the application of “detecting whether my AI believes statement X to be true” I don’t care about that. I do care about the difference between “true” and “humans think is true,” but that’s a big difference that I can understand (even if I can’t produce examples), and where I can articulate the sorts of cognition which probably should/shouldn’t be involved in it.
What’s the specific way you imagine this failing? Some options:
None of the features we identify really seem to correspond to something resembling our intuitive notion of “truth” (e.g. because they frequently activate on unrelated concepts).
We get a bunch of features that look like truth, but can’t really tell what goes into computing them.
We get a bunch of features that look like truth and we have some vague sense of how they’re computed, but they don’t seem differentiated in how “sketchy” these computational graphs look: either they all seem to rely on social reasoning or they all don’t seem to.
Maybe a better question would be—why didn’t these issues (lack of robust explanation) get in the way of the Steinhardt paper I linked? They were in fact able to execute something like the plan I sketch here: use vague understanding to guess which model components attend to features which are spuriously correlated with the thing you want, then use the rest of the model as an improved classifier for the thing you want.
My proposed list (which borrows from your list):
We find a large number (e.g. 30,000) of features which all sorta look somewhat like truth, though none exactly look like truth. Further analysis doesn’t make it clear which of these are “real” or “actually truth”. Some features look more like truth and some look a bit less like truth, but broadly there is a smooth fall of in how “truth like” the features look such that there aren’t a small set of discrete truth features. No single feature both looks like truth and correlates perfectly with our labeled datasets.
We get a bunch of features that look (at least somewhat) like truth and we have some vague sense of how they’re computed, but they don’t seem differentiated in how “sketchy” these computational graphs look: either they all seem to rely on social reasoning or they all don’t seem to.
We get a bunch of features that look like truth, but looking at what they connect to doesn’t make much sense and just makes us more confused overall. There are many diffuse connections and it’s unclear what they do.
Everything looks fine and we apply the method, but it turns out there isn’t any feature we’ve identified which corresponds to “actual truth” as this isn’t very salient for the model in the regime we are interested in.
Sure, but then why not just train a probe? If we don’t care about much precision what goes wrong with the probe approach?
It’s possible to improve on a just a probe trained on the data we can construct of course, but you’ll need non-trivial precision to do so.
The key question here is “why does selecting a feature work while just naively training a probe fails”.
We have to be getting some additional bits from our selection of the feature.
In more detail, let’s suppose we use the following process:
Select all features which individually get loss < X on our training set. Choose X such that if we get that loss on our training set, we’re only worried about generalization error rather than errors which show up on the training set (~equivalently, we’re well into diminishing returns on loss).
Try to pick among these features or combine these features to produce a better classifer.
Then there are two issues:
Maybe there isn’t any feature which gets < X loss. (We can relax our requirements, but we wanted to compete with probes!)
When we select among these features do we get a non-trivial number of “bits” of improvement? Is that enough bits to achieve what we wanted? I’m somewhat skeptical we can get much if any improvement here. (Of course, note that this doesn’t mean the idea has no promise!)
IMO there are two kinda separate (and separable) things going on:
Maybe features are a good prior for classifiers for some reason for the decomposition you picked. (I don’t really see why this would be true for currently used decompositions.) Why would selecting features based on performing well on our dataset be better than just training a probe?
Maybe looking at the connections of your classifer (what earlier features it connects to and what these connect to) and applying selection to the classifer based on the connections will be good. This can totally be applied to probe based classifiers. (Maybe there is some reason why looking at connections will be especially good for classifers based on picking a feature but not training a probe, but if so, why?)
“Can this be applied to probes” is a crux for me. It sounds like you’re imagining something like:
Train a bunch of truthfulness probes regularized to be distinct from each other.
Train a bunch of probes for “blacklsited” features which we don’t think should be associated to truth (e.g. social reasoning, intent to lie, etc.).
(Unsure about this step.) Check which truth directions are causally downstream of blacklisted feature directions (with patching experiments?). Use that to discriminate among the probes.
Is that right?
This is not an option I had considered, and it would be very exciting to me if it worked. I have some vague intuition that this should all go better when you are working with features (e.g. because the causal dependencies among the features should be sparse), but I would definitely need to think about that position more.
I was actually imagining a hybrid between probes and features. The actual classifier doesn’t need to be part of a complete decomposition, but for the connections we do maybe want the complete decomposition to fully analyze connections including the recursive case.
So:
Train a bunch of truthfulness probes regularized to be distinct from each other.
Check feature connections for these probes and select accordingly.
I also think there’s a pretty straightforward to do this without needing to train a bunch of probes (e.g. train probes to be orthogonal to undesirable stuff or whatever rather than needing to train a bunch).
As you mentioned you probably can do with entirely just learned probes, via a mechanism like the one you said (but this is less clean than decomposition).
You could also apply amnesic probing which has somewhat different properties than looking at the decomposition (amnesic probing is where you remove some dimensions to avoid being able to discriminate certain classes via LEACE as we discussed in the measurement tampering paper).
(TBC, doesn’t seem that useful to argue about “what is mech interp”, I think the more central question is “how likely is it that all this prior work and ideas related to mech interp are useful”. This is a strictly higher bar, but we should apply the same adjustment for work in all other areas etc.)
More generally, it seems good to be careful about thinking through questions like “does using X have a principled reason to be better than applying the ‘default’ approach (e.g. training a probe)”. Good to do this regardless of actually using the default approach so we know where the juice is coming from.
In the case of mech interp style decompositions, I’m pretty skeptical that there is any juice in finding your classifier by doing something like selecting over components rather than training a probe. But, there could theoretically be juice in trying to understand how a probe works by looking at its connections (and the connections of its connections etc).
Here’s a reasonable example where naively training a probe fails. The model lies if any of N features is “true”. One of the features is almost always activated at the same time as some others, such that in the training set it never solely determines whether the model lies.
Then, a probe trained on the activations may not pick up on that feature. Whereas if we can look at model weights, we can see that this feature also matters, and include it in our lying classifier.
This particular case can also be solved by adversarially attacking the probe though.
My guess is that the classification task for waterbirds is sufficiently easy that butchering a substantial part of the model is fine. It won’t usually be viable to ablate everything that looks similar to an undesirable property. In some cases, this might be fine due to redundancy, but if there is heavy redundancy, I also expect that you’ve missed some stuff if you just look for components which look to a given target.
Not super high confidence overall.
Edit: it also seems likely to me that there is a more principled and simpler approach like using LEACE which works just as well or better (but I’m unsure and I’m not familiar with that paper or the literature here).
I mean reducing doom by a large amount for very powerful models.