Overall, my view is that we will need to solve the optimization problem of ‘what properties of the activation distribution are sufficient to explain how the model behaves’, but this solution can be represented somewhat implicitly and I don’t currently see how you’d transition it into a solution to superposition in the sense I think you mean.
I’ll try to explain why I have this view, but it seems likely I’ll fail (at least partially because of my own confusions).
Quickly, some background so we’re hopefully on the same page (or at least closer):
I’m imagining the setting described here. Note that anomalies are detected with respect to a distribution D (for a new datapoint x∗! So, we need a distribution where we’re happy with the reason why the model works.
This setting is restrictive in various ways (e.g., see here), but I think that practical and robust solutions would be a large advancement regardless (extra points for an approach which fails to have on paper counterexamples).
Now the approaches to anomaly detection I typically think about work roughly like this:
Try to find an ‘explanation’/‘hypothesis’ for the variance on D which doesn’t also ‘explain’ deviation from the mean on x∗. (We’re worst casing over explanations)
If we succeed, then x∗ is anomalous. Otherwise, it’s considered non-anomalous.
Note that I’m using scare quotes around explanation/hypothesis—I’m refering to an object which matches some of the intutive properties of explanations and/or hypotheses, but it’s not clear exactly which properties we will and won’t need.
This stated approach is very inefficient (it requires learning an explanation for each new datum x∗!), but various optimizations are plausible (e.g., having a minimal base explanation for D which we can quickly finetune for each datum x∗).
I’m typically thinking about anomaly detection schemes which use approaches similar to causal scrubbing, though Paul, Mark, and other people at ARC typically think about heuristic arguments (which have quite different properties).
Now back to superposition.
A working solution must let you know if atypical features have fired, but not which atypical features or what direction those atypical features use. Beyond this, we might hope that the ‘explanation’ for the variance on D can tell use which directions the model uses for representing important information. This will sometimes be true, but I think this is probably false in general, though I’m having trouble articulating my intuitions for this. Minimally, I think it’s very unclear how you would extract this information if you use causal scrubbing based approaches.
I plan on walking through an example which is similar to how we plan on tacking anomaly detection with causal scrubbing in a future comment, but I need to go get lunch.
Overall, my view is that we will need to solve the optimization problem of ‘what properties of the activation distribution are sufficient to explain how the model behaves’, but this solution can be represented somewhat implicitly and I don’t currently see how you’d transition it into a solution to superposition in the sense I think you mean.
I’ll try to explain why I have this view, but it seems likely I’ll fail (at least partially because of my own confusions).
Quickly, some background so we’re hopefully on the same page (or at least closer):
I’m imagining the setting described here. Note that anomalies are detected with respect to a distribution D (for a new datapoint x∗! So, we need a distribution where we’re happy with the reason why the model works.
This setting is restrictive in various ways (e.g., see here), but I think that practical and robust solutions would be a large advancement regardless (extra points for an approach which fails to have on paper counterexamples).
Now the approaches to anomaly detection I typically think about work roughly like this:
Try to find an ‘explanation’/‘hypothesis’ for the variance on D which doesn’t also ‘explain’ deviation from the mean on x∗. (We’re worst casing over explanations)
If we succeed, then x∗ is anomalous. Otherwise, it’s considered non-anomalous.
Note that I’m using scare quotes around explanation/hypothesis—I’m refering to an object which matches some of the intutive properties of explanations and/or hypotheses, but it’s not clear exactly which properties we will and won’t need.
This stated approach is very inefficient (it requires learning an explanation for each new datum x∗!), but various optimizations are plausible (e.g., having a minimal base explanation for D which we can quickly finetune for each datum x∗).
I’m typically thinking about anomaly detection schemes which use approaches similar to causal scrubbing, though Paul, Mark, and other people at ARC typically think about heuristic arguments (which have quite different properties).
Now back to superposition.
A working solution must let you know if atypical features have fired, but not which atypical features or what direction those atypical features use. Beyond this, we might hope that the ‘explanation’ for the variance on D can tell use which directions the model uses for representing important information. This will sometimes be true, but I think this is probably false in general, though I’m having trouble articulating my intuitions for this. Minimally, I think it’s very unclear how you would extract this information if you use causal scrubbing based approaches.
I plan on walking through an example which is similar to how we plan on tacking anomaly detection with causal scrubbing in a future comment, but I need to go get lunch.