I agree this doesn’t distinguish superposition vs no superposition at all; I was more thinking about the “error correction” aspect of MCIS (and just assuming superposition to be true). But I’m excited too for the SAE application, we got some experiments in the pipeline!
Your Correct behaviour point sounds reasonable but I feel like it’s not an explanation? I would have the same intuitive expectation, but that doesn’t explain how the model manages to not be sensitive. Explanations I can think of in increasing order of probability:
Story 0: Perturbations change activations and logprobs, but the answer doesn’t change because the logprob difference was large. I don’t think the KL divergence would behave like that.
Story 1: Perturbations do change the activations but the difference in the logprobs is small due to layer norm, unembed, or softmax shenanigans.
We did a test-experiment of perturbing the 12th layer rather than the 2nd layer, and the difference between real-other and random disappeared. So I don’t think it’s a weird effect when activations get converted to outputs.
Story 2: Perturbations in a lower layer cause less perturbation in later layers if the model is on-distribution (+ similar story for sensitivity).
This is what the L2-metric plots (right panel) suggest, and also what I understand your story to be.
But this doesn’t explain how the model does this, right? Are there simple stories how this happens?
I guess there’s lots of stories not limited to MCIS, anything along the lines of “ReLUs require thresholds to be passed”?
Based on that, I think the results still require some “error-correction” explanation, though you’re right that this doesn’t have to me MCIS (it’s just that there’s no other theory that doesn’t also conflict with superposition?).
Thanks for the comment Lawrence, I appreciate it!
I agree this doesn’t distinguish superposition vs no superposition at all; I was more thinking about the “error correction” aspect of MCIS (and just assuming superposition to be true). But I’m excited too for the SAE application, we got some experiments in the pipeline!
Your Correct behaviour point sounds reasonable but I feel like it’s not an explanation? I would have the same intuitive expectation, but that doesn’t explain how the model manages to not be sensitive. Explanations I can think of in increasing order of probability:
Story 0: Perturbations change activations and logprobs, but the answer doesn’t change because the logprob difference was large. I don’t think the KL divergence would behave like that.
Story 1: Perturbations do change the activations but the difference in the logprobs is small due to layer norm, unembed, or softmax shenanigans.
We did a test-experiment of perturbing the 12th layer rather than the 2nd layer, and the difference between real-other and random disappeared. So I don’t think it’s a weird effect when activations get converted to outputs.
Story 2: Perturbations in a lower layer cause less perturbation in later layers if the model is on-distribution (+ similar story for sensitivity).
This is what the L2-metric plots (right panel) suggest, and also what I understand your story to be.
But this doesn’t explain how the model does this, right? Are there simple stories how this happens?
I guess there’s lots of stories not limited to MCIS, anything along the lines of “ReLUs require thresholds to be passed”?
Based on that, I think the results still require some “error-correction” explanation, though you’re right that this doesn’t have to me MCIS (it’s just that there’s no other theory that doesn’t also conflict with superposition?).