I feel pretty confused, but my overall view is that many of the routes I currently feel are most promising don’t require solving superposition.
It seems quite plausible there might be ways to solve mechanistic interpretability which frame things differently. However, I presently expect that they’ll need to do something which is equivalent to solving superposition, even if they don’t solve it explicitly. (I don’t fully understand your perspective, so it’s possible I’m misunderstanding something though!)
To give a concrete example (although this is easier than what I actually envision), let’s consider this model from Adam Jermyn’s repeated data extension of our paper:
If you want to know whether the model is “generalizing” rather than “triggering a special case” you need to distinguish the “single data point feature” direction from normal linear combinations of features. Now, it happens to be the case that the specific geometry of the 2D case we’re visualizing here means that isn’t too hard. But we need to solve this in general. (I’m imagining this as a proxy for a model which has one “special case backdoor/evil feature” in superposition with lots of benign features. We need to know if the “backdoor/evil feature” activated rather than an unusual combination of normal features.)
Of course, there may be ways to distinguish this without the language of features and superposition. Maybe those are even better framings! But if you can, it seems to me that you should then be able to backtrack that solution into a sparse coding solution (if you know whether a feature has fired, it’s now easy to learn the true sparse code!). So it seems to me that you end up having done something equivalent.
Again, all of these comments are without really understanding your view of how these problems might be solved. It’s very possible I’m missing something.
Overall, my view is that we will need to solve the optimization problem of ‘what properties of the activation distribution are sufficient to explain how the model behaves’, but this solution can be represented somewhat implicitly and I don’t currently see how you’d transition it into a solution to superposition in the sense I think you mean.
I’ll try to explain why I have this view, but it seems likely I’ll fail (at least partially because of my own confusions).
Quickly, some background so we’re hopefully on the same page (or at least closer):
I’m imagining the setting described here. Note that anomalies are detected with respect to a distribution D (for a new datapoint x∗! So, we need a distribution where we’re happy with the reason why the model works.
This setting is restrictive in various ways (e.g., see here), but I think that practical and robust solutions would be a large advancement regardless (extra points for an approach which fails to have on paper counterexamples).
Now the approaches to anomaly detection I typically think about work roughly like this:
Try to find an ‘explanation’/‘hypothesis’ for the variance on D which doesn’t also ‘explain’ deviation from the mean on x∗. (We’re worst casing over explanations)
If we succeed, then x∗ is anomalous. Otherwise, it’s considered non-anomalous.
Note that I’m using scare quotes around explanation/hypothesis—I’m refering to an object which matches some of the intutive properties of explanations and/or hypotheses, but it’s not clear exactly which properties we will and won’t need.
This stated approach is very inefficient (it requires learning an explanation for each new datum x∗!), but various optimizations are plausible (e.g., having a minimal base explanation for D which we can quickly finetune for each datum x∗).
I’m typically thinking about anomaly detection schemes which use approaches similar to causal scrubbing, though Paul, Mark, and other people at ARC typically think about heuristic arguments (which have quite different properties).
Now back to superposition.
A working solution must let you know if atypical features have fired, but not which atypical features or what direction those atypical features use. Beyond this, we might hope that the ‘explanation’ for the variance on D can tell use which directions the model uses for representing important information. This will sometimes be true, but I think this is probably false in general, though I’m having trouble articulating my intuitions for this. Minimally, I think it’s very unclear how you would extract this information if you use causal scrubbing based approaches.
I plan on walking through an example which is similar to how we plan on tacking anomaly detection with causal scrubbing in a future comment, but I need to go get lunch.
It seems quite plausible there might be ways to solve mechanistic interpretability which frame things differently. However, I presently expect that they’ll need to do something which is equivalent to solving superposition, even if they don’t solve it explicitly. (I don’t fully understand your perspective, so it’s possible I’m misunderstanding something though!)
To give a concrete example (although this is easier than what I actually envision), let’s consider this model from Adam Jermyn’s repeated data extension of our paper:
If you want to know whether the model is “generalizing” rather than “triggering a special case” you need to distinguish the “single data point feature” direction from normal linear combinations of features. Now, it happens to be the case that the specific geometry of the 2D case we’re visualizing here means that isn’t too hard. But we need to solve this in general. (I’m imagining this as a proxy for a model which has one “special case backdoor/evil feature” in superposition with lots of benign features. We need to know if the “backdoor/evil feature” activated rather than an unusual combination of normal features.)
Of course, there may be ways to distinguish this without the language of features and superposition. Maybe those are even better framings! But if you can, it seems to me that you should then be able to backtrack that solution into a sparse coding solution (if you know whether a feature has fired, it’s now easy to learn the true sparse code!). So it seems to me that you end up having done something equivalent.
Again, all of these comments are without really understanding your view of how these problems might be solved. It’s very possible I’m missing something.
Overall, my view is that we will need to solve the optimization problem of ‘what properties of the activation distribution are sufficient to explain how the model behaves’, but this solution can be represented somewhat implicitly and I don’t currently see how you’d transition it into a solution to superposition in the sense I think you mean.
I’ll try to explain why I have this view, but it seems likely I’ll fail (at least partially because of my own confusions).
Quickly, some background so we’re hopefully on the same page (or at least closer):
I’m imagining the setting described here. Note that anomalies are detected with respect to a distribution D (for a new datapoint x∗! So, we need a distribution where we’re happy with the reason why the model works.
This setting is restrictive in various ways (e.g., see here), but I think that practical and robust solutions would be a large advancement regardless (extra points for an approach which fails to have on paper counterexamples).
Now the approaches to anomaly detection I typically think about work roughly like this:
Try to find an ‘explanation’/‘hypothesis’ for the variance on D which doesn’t also ‘explain’ deviation from the mean on x∗. (We’re worst casing over explanations)
If we succeed, then x∗ is anomalous. Otherwise, it’s considered non-anomalous.
Note that I’m using scare quotes around explanation/hypothesis—I’m refering to an object which matches some of the intutive properties of explanations and/or hypotheses, but it’s not clear exactly which properties we will and won’t need.
This stated approach is very inefficient (it requires learning an explanation for each new datum x∗!), but various optimizations are plausible (e.g., having a minimal base explanation for D which we can quickly finetune for each datum x∗).
I’m typically thinking about anomaly detection schemes which use approaches similar to causal scrubbing, though Paul, Mark, and other people at ARC typically think about heuristic arguments (which have quite different properties).
Now back to superposition.
A working solution must let you know if atypical features have fired, but not which atypical features or what direction those atypical features use. Beyond this, we might hope that the ‘explanation’ for the variance on D can tell use which directions the model uses for representing important information. This will sometimes be true, but I think this is probably false in general, though I’m having trouble articulating my intuitions for this. Minimally, I think it’s very unclear how you would extract this information if you use causal scrubbing based approaches.
I plan on walking through an example which is similar to how we plan on tacking anomaly detection with causal scrubbing in a future comment, but I need to go get lunch.